Seminars of Software and Services for the Information Society

Size: px

Start display at page:

Download "Seminars of Software and Services for the Information Society"

Jared Caldwell
5 years ago
Views:

DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Master of Science in Engineering in Computer Science (MSE-CS) Seminars in Software and Services for the

1 DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Master of Science in Engineering in Computer Science (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Lara Malfatti (MD-Thesis, March 2013) Data Mining for evaluating the risk of chemotherapy-associated thrombosis Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) 1

2 Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 2

3 Venous Thrombo-Embolism (VTE) It increases from 0,1% in general population to 3% in cancer patients It is the second cause of mortality in cancer patients Its treatment represents a big cost for National Health Service (about per patient) 3

4 Data set description Dataset contains 565 instances (526 negative + 39 positive). Each entry contains 35 variables which can be grouped in: 1. Patient risk factors: as age, sex, laboratory analysis and comorbid condition (i.e. obesity) 2. Cancer risk factors: as site and stage of tumor 3. Treatment risk factors: as assumption of chemotherapy or targeted therapy agents 4

5 State of the art 5

6 Terminology Classification process: takes in input an instance and tries to forecast if it will be positive or negative Medical evaluation metrics are derived from the related confusion matrix: 6

7 Statistical approach: Khorana s score This model uses 5 biological variables as predictors and classifies patients into three risk categories: low, intermediate and high risk Num.of patients Metrics LOW INTERME DIATE HIGH Values Accuracy 53% PPV 10% NPV 96% Pros: Simple and clear model Low cost of predictive variables Cons: Too many patients classified as intermediate risk Poor performances 7

8 Challenge: Is it possible to find better variable combinations able to predict thrombosis through data mining? What is the the best predictive combination in terms of cost/benefit among all the possible ones? Are the screening cost of these combinations sustainable by the National Health Service? 8

9 Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 9

10 Knowledge Discovery in Health Care 10

workflow from data preprocessing to the visualization of discovered patterns

11 WEKA WEKA: Waikato Environment for Knowledge Analysis It is a free tool for data mining applications, written in JAVA It implements all the steps of KDD workflow from data preprocessing to the visualization of discovered patterns Attention is focused on data preprocessing, attribute selection and learning phase 11

12 WEKA: learning phase Learning phase: training and testing data sets must be disjoint Unbalanced data set causes: Excessive influence of majority class on classification model High global performance without forecasting a single instance of the minority class The creation of balanced training and testing datasets is manually conducted during the preprocessing phase 12

13 Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 13

14 Data set pre-processing: cleaning (1/3) Create three balanced folders and combine the partial results All the instances are classified exactly once All the training sets have the same number of positive and negative instances Training and testing datasets are disjoint Extra cost: each experiment needs three run execution 14

15 Data set pre-processing: cleaning (2/3) The objective is to remove noisy instances VTE normally falls within 6 months from the beginning of chemotherapy Outliers are given by: Time interval is enlarged to 12 months to cover also asymptomatic events Intrinsic probability of having a thrombotic event Changes in anticancer treatments 15

16 Data set preprocessing: improvements (3/3) Unstructured numerical data are aggregated, to not badly influence the classification model (see figure) Instances with missing values are discarded because: Artificial values cannot correspond to real cases They can create problems both in training and testing data set 16

17 Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 17

18 Attribute selection (1/2) Feature selection returns meaningful subsets of the original attributes ignoring the ones which provide no information Filter methods: they are independent from any learning algorithms and rely only on data properties they can be seen as the combination of search techniques for proposing new subsets and evaluation metrics to rank them WEKA provides lots of possibilities 18

19 Attribute selection (2/2) GreedyStepwise: performs a greedy search through the space of attribute subsets in both directions (backward and forward) starting from the empty set CorrelationFeautureSubSetEval: prefers subsets with attributes highly correlated with the class but having low inter-correlation 19

20 Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 20

21 Classification Guidelines: For each subset found in previous step some experiments are conducted using different learning algorithms PPV, NPV and Accuracy are compared, Khorana s results are used as benchmarks A constraint is fixed, no NPV values lower than 96% are allowed WEKA provides a variety of learning algorithms, the ones used in experiments are: Bayes algorithms, Decision trees, Cover rules, Logistic regression functions and Lazy algorithms 21

22 Classification: Accuracy All the predictive groups have better accuracy than Pure-KS 22

23 Classification: NPV Khorana group violates the NPV constraint which is under 96% 23

24 Classification: PPV WEKA and ThP groups doubles the PPV obtained by Pure-KS 24

25 Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 25

26 Cost Evaluation (1/2) Evaluation of the screening cost and eventual NHS savings 26

27 Cost Evaluation (2/2) In all the cases, National Health Service saves money from correctly predicted thrombosis (no treatment needed) and covers the screening costs at the same time Augmented-KS is the best predictive combination from an economic point of view 27

28 Outline Problem and contextualization Data Mining methodologies Dataset preprocessing Attributes selection Classification Costs evaluation Conclusion 28

29 Conclusion and future works From the use of data mining for the study of chemotherapyassociated thrombosis: PPV increases of 150% respect to the statistical approach NHS saves money from correctly predicted thrombosis and covers the screening costs at the same time Due to the limited size of dataset to be analyzed, better results can be reached: repeating the experiments by integrating more biological variables repeating the experiments by integrating more instances into dataset 29

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Data Mining: Classifier Evaluation CSCI-B490 Seminar in Computer Science (Data Mining) Predictor Evaluation 1. Question: how good is our algorithm? how will we estimate its performance? 2. Question: what