Prognosis of Lung Cancer Using Data Mining Techniques

Similar documents
FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Performance Analysis of Data Mining Classification Techniques

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Short Survey on Static Hand Gesture Recognition

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Filter methods for feature selection. A comparative study

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling

Chapter 1, Introduction

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

The Role of Biomedical Dataset in Classification

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan

Image Mining: frameworks and techniques

Tumor Detection and classification of Medical MRI UsingAdvance ROIPropANN Algorithm

A Performance Assessment on Various Data mining Tool Using Support Vector Machine

Gene Clustering & Classification

Performance Evaluation of Various Classification Algorithms

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

A Novel Feature Selection Framework for Automatic Web Page Classification

International Journal of Advanced Research in Computer Science and Software Engineering

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Classification and Optimization using RF and Genetic Algorithm

An Empirical Study on feature selection for Data Classification

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Machine Learning Techniques for Data Mining

International Journal of Advance Research in Engineering, Science & Technology

Chapter 8 The C 4.5*stat algorithm

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Weighting and selection of features.

Analysis of Feature Selection Techniques: A Data Mining Approach

Comparative analysis of classifier algorithm in data mining Aikjot Kaur Narula#, Dr.Raman Maini*

A Comparative Study of Selected Classification Algorithms of Data Mining

Basic Data Mining Technique

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

SOCIAL MEDIA MINING. Data Mining Essentials

Rank Measures for Ordering

Intrusion Detection Using Data Mining Technique (Classification)

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Weighted Heuristic Ensemble of Filters

ISSN ICIRET-2014

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Correlation Based Feature Selection with Irrelevant Feature Removal

User Guide Written By Yasser EL-Manzalawy

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Name of the lecturer Doç. Dr. Selma Ayşe ÖZEL

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

Optimizing Completion Techniques with Data Mining

Parametric Comparisons of Classification Techniques in Data Mining Applications

Improving Imputation Accuracy in Ordinal Data Using Classification

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Tap Position Inference on Smart Phones

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Bagging-Based Logistic Regression With Spark: A Medical Data Mining Method

Efficiently Handling Feature Redundancy in High-Dimensional Data

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Supervised Learning Classification Algorithms Comparison

Data Engineering. Data preprocessing and transformation

Keyword Extraction by KNN considering Similarity among Features

Forward Feature Selection Using Residual Mutual Information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

Link Prediction for Social Network

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Feature-weighted k-nearest Neighbor Classifier

A SURVEY OF DATA MINING & ITS APPLICATIONS

An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

KNIME Enalos+ Molecular Descriptor nodes

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

A Novel Algorithm for Associative Classification

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

CS570: Introduction to Data Mining

A Survey of Statistical Modeling Tools

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Sensor Based Time Series Classification of Body Movement

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Transcription:

Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science, Dr.M.G.R.Chockalingam Arts College. Abstract Data mining is the extraction of hidden predictive information and unknown data, patterns, relationships and knowledge by exploring the large data sets which are difficult to find and detect with traditional statistical methods. Data mining it is powerful technology which will discover most important information from the data warehouse of the organizations. It is a very crucial step that collectively examine large amount of routinely data. To find latest patterns in healthcare industry, there exist various interactive and scalable data mining methods. Data mining is a quantitative approach which is user friendly in reading reports and reducing errors and controls the quality more uniformly. Important task of data mining is data preprocessing. Data mining tools are used for decision making. Prediction and classification techniques are used in which classification technique predicts the unknown values with respect to generated model. An assortment of data mining techniques can be applied to find associations and regularities in data, extract knowledge in the forms of rules and predict the value of the dependent variables. Keywords: Data mining, Error, Generated model. 1. INTRODUCTION: In this thesis we propose an ensemble approach for feature selection, where multiple feature selection techniques are combined to yield more robust and stable results. Ensemble of multiple feature ranking techniques is performed in two steps. The first step involves creating a set of different feature selectors, each providing its sorted order of features, while the second step aggregates the results of all feature ranking techniques. The ensemble method used in our study is frequency count which is accompanied by mean to resolve any frequency count collision.data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships from large amounts of data stored in databases, data warehouses, or other information repositories. Data mining has two approaches. The first approach tries to produce an overall summary of a set of data to identify and describe main features. The second approach, pattern detection, seeks to identify small unusual patterns of behavior. The data mining analysis tasks typically fall into the following categories: data summarization, segmentation, classification, prediction, dependency analysis. One of the models is CRISP-DM [1]. It is a De Facto standard for industry. The CRISP- DM project began in mid-1997 to define and validate an industry and tool-neutral data mining process model. The six steps developed in this model are: business understanding, data understanding, data preparation, modeling, evaluation and deployment. Business understanding is the phase of understanding objectives and requirements of a project. Data understanding is the phase of becoming familiar with the data like identifying data quality problems, discover first insights into data. Data preparation phase describes the entire activities essential in constructing a final dataset from raw data. In the modeling phase 2277 2017 C. Saranya.al. http://www.irjaet.com

various modeling techniques are selected and applied to the model. Evaluation is the phase where the project is thoroughly evaluated before the final deployment. Deployment is the phase where the knowledge discovered will be organized and presented in a way a client can use. Therefore, it grasp various scientific disciplines: from mathematics and statistics to biology and genetics, each of which uses different terms to describe the topologies formed using this analysis. From biological taxonomies, to medical syndromes and genetic genotypes to manufacturing group technology the problem is identical: forming categories of entities and assigning individuals to the proper groups within it. Following are the various clustering algorithms used in healthcare. 2. RELATED WORK This section provides a brief coverage of the works performed in the area of ensemble feature ranking. These works assess how an ensemble of feature ranking techniques can improve robustness, performance and diversity. Feature ranking is a process of selecting the most relevant features from a large set of features. It is considered as one of the most critical problems researchers face today in data mining and machine learning. The main focus of ensemble feature ranking approach is on improving classification performance through the combination of feature ranking techniques. Very limited research exists on ensemble feature ranking. Early studies on ensemble of feature ranking techniques were performed by Rokach et al. [28]. The experiments in this study are performed to check whether ensemble of feature subsets improve classification accuracy over individual rankers. The experiments are performed on datasets obtained from UCI machine learning repository. Five different feature selection algorithms were used to generate 10 ensembles. The combining methods used for ensemble are: majority voting, take-it-all, smaller is heavier. The ensembles were evaluated using C4.5 classification model. The experimental results have shown that ensemble method performed better than individual feature rankers. Olsson and Oard [30] performed a study on combining feature selectors for text classification. The experiments were performed on two sets containing 23, 149 documents and 200,000 documents from RCV1-v2. The documents were combined using document frequency thresholding, information gain and the chi-square feature selection methods. The combination methods used are highest rank, lowest rank and average rank combination. The documents were classified using k-nearest neighbours with k=100. The evaluation criteria used for this study was R-precision. Wilker et al. [6] performed a study using six standard and eleven threshold based filter based feature ranking techniques. In this study six ensemble approaches were considered based on standard and threshold based filters. In addition, four other ensemble approaches were developed based on their robustness to class noise. This study used seven datasets from different domain applications, with different dimensions and different level of class imbalance. This work was evaluated on binary classification datasets. The experimental results showed that ensemble robustness can be predicated from the knowledge of individual components. Cancer is potentially fatal disease. Detecting cancer is still challenging for the doctors in the field of medicine. Even now the actual reason and complete cure of cancer is not invented. Detection of cancer in 2278 2017 C. Saranya.al. http://www.irjaet.com

earlier stage is curable. In this work we have developed a system called data mining based cancer prediction system. The main aim of this model is to provide the earlier warning to the users and it is also cost and time saving benefit to the user. It predicts three specific cancer risks. Specifically, Cancer prediction system estimates the risk of the breast, skin, and lung cancers by examining a number of userprovided genetic and non-genetic factors. This system is validated by comparing its predicted results with the patient s prior medical record and also this is analyzed using weka system. 3. PROCESS Data mining is the process of automatically collecting large volumes of data with the objective of finding hidden patterns and analyzing the relationships between numerous types of data to develop predictive models. we use the classification techniques. Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large. Dataset used in this model should be more precise and accurate in order to improve the predictive accuracy of data mining algorithms. Which is collected may have missing (or) irrelevant attributes. Fig.1. Analysis Data mining can be used to help predict future patient behavior and to improve treatment programs. By identifying high-risk patients, clinicians can better manage the care of patients today so that they do not become the problems of tomorrow. For instance, early stage lung and oral cancers are very hard to diagnose by conventional means; genomic signatures can be used to provide more timely and perhaps more accurate diagnosis [2]. Data mining refers to extracting or mining knowledge from large amounts of data. It is an increasingly popular field that uses statistical, visualization, machine learning, and other data manipulation and knowledge extraction techniques aimed at gaining an insight into the relationships and patterns hidden 2279 2017 C. Saranya.al. http://www.irjaet.com

in the data [3]. In principle, data mining should be applicable to any kind of information repository. This includes relational databases, data warehouses, transactional databases, advanced database systems, and the World-Wide Web. Advanced database systems include object-oriented, object-relational databases, and specific application-oriented databases, such as biological database, spatial databases, time-series databases, text databases, and multimedia databases 4. ANALYSIS We also perform ANOVA test on the AUC performance metric. ANOVA is an acronym for Analysis of Variance. It is defined as a procedure for assigning sample variance to different sources and making a decision if the variation is within or among different population groups. Samples are described in terms of variation around group means and variation of group means around an overall mean. If variations within groups are small relative to variations between groups, a difference in group means may be inferred. Hypothesis Tests are used to quantify decisions. 1 0.9 0.8 0.7 0.6 0.5 0.4 GR RFF SU OneR IG Ensemble Base Dataset KNN C4.5 NB RF LR SVM Fig.2. Waveform output In this notation parameters with two subscripts, such as (αβ) ij., represent the interaction effect of two factors. The parameter (αβγ) ijk represents the three-way interaction. An ANOVA model can have the full set of parameters or any subset, but conventionally it does not include complex interaction terms unless it also includes all simpler terms for those factors. CONCLUSION In this thesis, we have reviewed feature selection and explained the basic concept of different feature selection methods: filter, wrapper and hybrid model. We reviewed four filter based feature ranking techniques and one wrapper based feature ranking technique. They are information gain, gain ratio, symmetrical uncertainty, relieff and onerattribute evaluation. We examined classification models that are built using various classification techniques such as naïve bayes, k-nearest neighbor, Bayesian Network, support vector machine, Logistic Regression (J48) and decision trees. We took a brief review of the 2280 2017 C. Saranya.al. http://www.irjaet.com

evaluation criteria used to evaluate the classification models. We have also introduced ensemble methods for feature ranking technique that can help build stable and robust classification models. REFERENCES [1] J. Jackson, Data Mining: A Conceptual Overview, Communications of the Association for Information Systems (Volume 8, 2002), pages 267-296. [2] H. Wang, T. M. Khoshgoftaar, K. Gao, Ensemble feature selection technique for software quality classification, Proceedings of the 22nd International Conference on Software Engineering & Knowledge Engineering, Redwood City, San Francisco Bay, CA, USA, July 1 - July 3, 2010, pages 215-220. [3] H. Wang, T. M. Khoshgoftaar, A. Napolitano, A comparative study of ensemble feature selection techniques for software defect prediction, Proceedings of the Ninth International Conference on Machine Learning and Applications, Washington, DC, USA, December 12-14, 2010, pages 135-140. [4] H. Liu, H. Motoda, R. Setiono, Z. Zhao, Feature Selection: An Ever Evolving Frontier in Data Mining, JMLR: Workshop and Conference Proceedings 2010, Volume: 4, Publisher: Citeseer, pages 4-13. [5] L. Yu, H. Liu, Feature Selection for High-Dimensional Data: A Fast Correlation- Based Filter Solution, Proceedings of the Twentieth International Conference on Machine Leaning, ICML-03, Washington, D.C., August, 2003, pages 856-863. [6] W. Altidor, T. M. Khoshgoftaar, J. Van Hulse, A. Napolitano, Ensemble feature ranking methods for data intensive computing applications, Handbook of data intensive computing, Springer Science + Business media, LLC 2011, pages 349-376. 2281 2017 C. Saranya.al. http://www.irjaet.com