Seminars of Software and Services for the Information Society

Similar documents
Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Data Preprocessing. Slides by: Shree Jaswal

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining: STATISTICA

I211: Information infrastructure II

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Classification Algorithms in Data Mining

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Multi-label classification using rule-based classifier systems

Seminars of Software and Services for the Information Society. Data Warehousing Design Issues

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Louis Fourrier Fabien Gaie Thomas Rolf

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

A Survey on Pre-processing and Post-processing Techniques in Data Mining

CS145: INTRODUCTION TO DATA MINING

User Guide Written By Yasser EL-Manzalawy

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Function Algorithms: Linear Regression, Logistic Regression

Data Mining With Weka A Short Tutorial

Data Mining and Knowledge Discovery: Practice Notes

CS249: ADVANCED DATA MINING

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Variable Selection 6.783, Biomedical Decision Support

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

CSE 158. Web Mining and Recommender Systems. Midterm recap

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Prediction of Dialysis Length. Adrian Loy, Antje Schubotz 2 February 2017

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Tutorial Case studies

Data preprocessing Functional Programming and Intelligent Algorithms

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Machine Learning Techniques for Data Mining

WEKA homepage.

Network Traffic Measurements and Analysis

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Chapter 3: Supervised Learning

Performance Evaluation of Various Classification Algorithms

Nearest neighbor classification DSE 220

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

COMP 465 Special Topics: Data Mining

Performance Analysis of Data Mining Classification Techniques

Classification and Optimization using RF and Genetic Algorithm

Query Disambiguation from Web Search Logs

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Prognosis of Lung Cancer Using Data Mining Techniques

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Lecture on Modeling Tools for Clustering & Regression

Supervised and Unsupervised Learning (II)

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

Data Mining and Knowledge Discovery Practice notes Numeric prediction and descriptive DM

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

MetaData for Database Mining

KNIME Enalos+ Molecular Descriptor nodes

3. Data Preprocessing. 3.1 Introduction

Collaborative Filtering using a Spreading Activation Approach

2. Data Preprocessing

Author Prediction for Turkish Texts

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

Statistical Matching using Fractional Imputation

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Cyber attack detection using decision tree approach

PROBLEM FORMULATION AND RESEARCH METHODOLOGY

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Chapter 1, Introduction

LECTURE 11: LINEAR MODEL SELECTION PT. 2. October 18, 2017 SDS 293: Machine Learning

Chapter 8 The C 4.5*stat algorithm

Chapter 3: Data Mining:

Data Mining. Lecture 03: Nearest Neighbor Learning

The Data Mining Application Based on WEKA: Geographical Original of Music

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

FEATURE SELECTION TECHNIQUES

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Fast or furious? - User analysis of SF Express Inc

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Neural Networks and Machine Learning Applied to Classification of Cancer. Sachin Govind, Advisor: Namrata Pandya, IMSA

The importance of adequate data pre-processing in early diagnosis: classification of arrhythmias, a case study

PSS718 - Data Mining

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Classification and Regression

Machine Learning Feature Creation and Selection

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Deduplication of Hospital Data using Genetic Programming

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

ECLT 5810 Clustering

Data Preprocessing. Komate AMPHAWAN

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

A Comparative Study of Selected Classification Algorithms of Data Mining

An Empirical Study on Lazy Multilabel Classification Algorithms

Transcription:

DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Master of Science in Engineering in Computer Science (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Lara Malfatti (MD-Thesis, March 2013) Data Mining for evaluating the risk of chemotherapy-associated thrombosis Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) 1

Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 2

Venous Thrombo-Embolism (VTE) It increases from 0,1% in general population to 3% in cancer patients It is the second cause of mortality in cancer patients Its treatment represents a big cost for National Health Service (about 8.000 per patient) 3

Data set description Dataset contains 565 instances (526 negative + 39 positive). Each entry contains 35 variables which can be grouped in: 1. Patient risk factors: as age, sex, laboratory analysis and comorbid condition (i.e. obesity) 2. Cancer risk factors: as site and stage of tumor 3. Treatment risk factors: as assumption of chemotherapy or targeted therapy agents 4

State of the art 5

Terminology Classification process: takes in input an instance and tries to forecast if it will be positive or negative Medical evaluation metrics are derived from the related confusion matrix: 6

Statistical approach: Khorana s score This model uses 5 biological variables as predictors and classifies patients into three risk categories: low, intermediate and high risk Num.of patients Metrics LOW INTERME DIATE HIGH 280 252 33 Values Accuracy 53% PPV 10% NPV 96% Pros: Simple and clear model Low cost of predictive variables Cons: Too many patients classified as intermediate risk Poor performances 7

Challenge: Is it possible to find better variable combinations able to predict thrombosis through data mining? What is the the best predictive combination in terms of cost/benefit among all the possible ones? Are the screening cost of these combinations sustainable by the National Health Service? 8

Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 9

Knowledge Discovery in Health Care 10

WEKA WEKA: Waikato Environment for Knowledge Analysis It is a free tool for data mining applications, written in JAVA It implements all the steps of KDD workflow from data preprocessing to the visualization of discovered patterns Attention is focused on data preprocessing, attribute selection and learning phase 11

WEKA: learning phase Learning phase: training and testing data sets must be disjoint Unbalanced data set causes: Excessive influence of majority class on classification model High global performance without forecasting a single instance of the minority class The creation of balanced training and testing datasets is manually conducted during the preprocessing phase 12

Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 13

Data set pre-processing: cleaning (1/3) Create three balanced folders and combine the partial results All the instances are classified exactly once All the training sets have the same number of positive and negative instances Training and testing datasets are disjoint Extra cost: each experiment needs three run execution 14

Data set pre-processing: cleaning (2/3) The objective is to remove noisy instances VTE normally falls within 6 months from the beginning of chemotherapy Outliers are given by: Time interval is enlarged to 12 months to cover also asymptomatic events Intrinsic probability of having a thrombotic event Changes in anticancer treatments 15

Data set preprocessing: improvements (3/3) Unstructured numerical data are aggregated, to not badly influence the classification model (see figure) Instances with missing values are discarded because: Artificial values cannot correspond to real cases They can create problems both in training and testing data set 16

Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 17

Attribute selection (1/2) Feature selection returns meaningful subsets of the original attributes ignoring the ones which provide no information Filter methods: they are independent from any learning algorithms and rely only on data properties they can be seen as the combination of search techniques for proposing new subsets and evaluation metrics to rank them WEKA provides lots of possibilities 18

Attribute selection (2/2) GreedyStepwise: performs a greedy search through the space of attribute subsets in both directions (backward and forward) starting from the empty set CorrelationFeautureSubSetEval: prefers subsets with attributes highly correlated with the class but having low inter-correlation 19

Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 20

Classification Guidelines: For each subset found in previous step some experiments are conducted using different learning algorithms PPV, NPV and Accuracy are compared, Khorana s results are used as benchmarks A constraint is fixed, no NPV values lower than 96% are allowed WEKA provides a variety of learning algorithms, the ones used in experiments are: Bayes algorithms, Decision trees, Cover rules, Logistic regression functions and Lazy algorithms 21

Classification: Accuracy All the predictive groups have better accuracy than Pure-KS 22

Classification: NPV Khorana group violates the NPV constraint which is under 96% 23

Classification: PPV WEKA and ThP groups doubles the PPV obtained by Pure-KS 24

Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 25

Cost Evaluation (1/2) Evaluation of the screening cost and eventual NHS savings 26

Cost Evaluation (2/2) In all the cases, National Health Service saves money from correctly predicted thrombosis (no treatment needed) and covers the screening costs at the same time Augmented-KS is the best predictive combination from an economic point of view 27

Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 28

Conclusion and future works From the use of data mining for the study of chemotherapyassociated thrombosis: PPV increases of 150% respect to the statistical approach NHS saves money from correctly predicted thrombosis and covers the screening costs at the same time Due to the limited size of dataset to be analyzed, better results can be reached: repeating the experiments by integrating more biological variables repeating the experiments by integrating more instances into dataset 29