Machine Learning and Bioinformatics 機器學習與生物資訊學

Size: px

Start display at page:

Download "Machine Learning and Bioinformatics 機器學習與生物資訊學"

Cynthia Long
5 years ago
Views:

1 Molecular Biomedical Informatics 分子生醫資訊實驗室機器學習與生物資訊學 Machine Learning & Bioinformatics 1

2 Evaluation The key to success 2

3 Three datasets of which the answers must be known 3

4 Note on parameter tuning It is important that the testing data is not used in any way to create the classifier Some learning schemes operate in two stages build the basic structure optimize parameters The testing data can not be used for parameter tuning proper procedure uses three sets: training, tuning and testing data 4

5 Data is usually limited Error on the training data is NOT a good indicator of performance on future data otherwise 1NN would be the optimum classifier Not a problem if lots of (answered) data is available split data into training, turning and testing sets However, (answered) data is usually limited More sophisticated techniques need to be used 5

6 Issues in evaluation Statistical reliability of estimated differences in performance significance tests Choice of performance measures number of correctly classified samples ratio of correctly classified samples error in numeric predictions Costs assigned to different types of errors many practical applications involve costs 6

7 Training and testing sets Testing set must play no part, including parameter tuning, in classifier formation Ideally, both training and testing sets are representative samples of the underlying problem, but they may differ in nature we got data from two different towns A and B and want to estimate the performance of our classifier in a completely new town 7

8 Which (training vs. tuning/testing) should be more similar to the target new town? 8

9 Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier for real (unknown) data A dilemma generally, the larger the training data the better the classifier (but returns diminish) the larger the testing data the more accurate the error estimate 9

10 Holdout procedure Method of splitting original data into training and testing sets Reserve a certain amount for testing and use the remainder for training usually one third for testing and the rest for training The samples might not be representative e.g., class might be missing in the testing data Stratification ensures that each class is represented with approximately equal proportions in both subsets 10

11 Repeated holdout procedure Holdout procedure can be made more reliable by repeating the process with different subsamples in each iteration, a certain proportion is randomly selected for testing (possibly with stratification) the error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout procedure A problem is that the different testing sets overlap 11

12 Cross-validation Cross-validation avoids overlapping test sets split data into n subsets of equal size use each subset in turn for testing, the remainder for training the error estimates are averaged to yield an overall error estimate Called n-fold cross-validation Often the subsets are stratified before the crossvalidation is performed 12

13 More on cross-validation Stratified ten-fold cross-validation Why ten? extensive experiments have shown that this is the best choice to get an accurate estimate there is also some theoretical evidence for this Repeated stratified cross-validation e.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) 13

14 Leave-One-Out cross-validation A particular form of cross-validation set number of folds to number of training instances Makes best use of the data Involves no random subsampling Advantage and disadvantage Very computationally expensive 14

15 LOO-CV and stratification Stratification is not possible there is only one instance in the testing set An extreme example random dataset split equally into two classes best inducer predicts majority class 50% accuracy on fresh data LOO-CV estimate is 100% error 15

16 Cost 16

17 Counting the cost In practice, different types of classification errors often incur different costs Examples terrorist profiling, where predicting negative achieves 99.99% accuracy loan decisions oil-slick detection fault diagnosis promotional mailing 17

18 Confusion matrix Actual class Yes Yes True positive Predicted class No False negative No False positive True negative 18

19 Classification with costs Two cost matrices Actual class Predicted class Predicted class Yes No Yes Actual No class Error rate is replaced by average cost per prediction 19

20 Cost-sensitive learning A basic idea is to only predict high-cost class when very confident about the prediction Instead predicting the most likely class, we should make the prediction that minimizes the expected cost dot product of class probabilities and appropriate column in cost matrix choose column (class) that minimizes expected cost Not at training time Most learning schemes do not perform cost sensitive learning they generate the same classifier no matter what costs are assigned to the different classes 20

21 A simple method for cost-sensitive learning 21

22 Resampling of instances according to costs 22

23 Measures 23

24 Lift charts In practice, costs are rarely known Decisions are usually made by comparing possible scenarios E.g., promotional mail to 1,000,000 households mail to all; 0.1% respond (1000) a data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) another tool identifies subset of 400,000 most promising, 0.2% respond (800) Which is better? A lift chart allows a visual comparison 24

25 Generating a lift chart Sort instances according to predicted probability of being positive Predicted probability Actual class Yes Yes No Yes x-axis is sample size; y-axis is number of true positives 25

26 A hypothetical lift chart 26

27 ROC curves ROC curves are similar to lift charts stands for receiver operating characteristic used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences to lift chart y-axis shows percentage of true positives in sample rather than absolute number x-axis shows percentage of false positives in sample rather than sample size 27

28 A sample ROC curve Jagged curve one set of test data Smooth curve use cross-validation 28

29 More measures Precision = Recall = TP FP+FP TP FP+FN, percentage of reported samples that are positive, percentage of positive samples that are reported Precision/recall curves have hyperbolic shape Three-point average is the average precision at 20%, 50% and 80% recall F-measure = 2 recall precision, harmonic mean of precision and recall recall+precision makes precision and recall as equal as possible Specificity = TN FP+TN Area under the ROC curve (AUC), percentage of negative samples that are not reported 29

30 Summary of some measures Domain Plot Explanation Lift chart Marketing TP Subset size ROC curve Recallprecision curve Communications TP rate FP rate Information retrieval Recall Precision TP (TP+FP)/(TP+FP+TN+FN) TP/(TP+FN) FP/(FP+TN) TP/(TP+FN) TP/(TP+FP) 30

31 Evaluating numeric prediction Same strategies, including independent testing sets, cross-validation, significance tests, etc. 31

32 Measures in numeric prediction Actual target values: a 1, a 2,, a n Predicted target values: p 1, p 2,, p n The most popular measure is mean squared error (MSE), p 1 a p 2 a p n a 2 n, because it is easy to manipulate mathematically n 32

33 Other measures Root mean squared error (RMSE) = p 1 a p 2 a p n a n 2 Mean absolute error (MAE), p 1 a 1 + p 2 a p n a n n n outliers than MSE Sometimes relative error values are more appropriate, is less sensitive to 33

34 Improvement on the mean How much does the scheme improve on simply predicting the average? Relative squared error = p 1 a p 2 a p n a 2 n a a a a a a 2 n Relative absolute error = p 1 a 1 + p 2 a p n a n a a 1 + a a a a n 34

35 Correlation coefficient / 相關係數 Measures the statistical correlation between the predicted values and the actual values cov X, Y ρ X,Y = = E X μ X Y μ Y σ X σ Y σ X σ Y Scale independent, between 1 and +1 Good performance leads to large values 35

36 36

37 Which measure? Best to look at all of them Often it doesn t matter A B C D Root mean-squared error Mean absolute error Root relative squared error 42.2% 57.2% 39.4% 35.8% Relative absolute error 43.1% 40.1% 34.8% 30.4% Correlation coefficient D the best; C the second-best; A and B are arguable 37

38 Today s exercise Machine Learning & Bioinformatics 38

Finally, commit your best version and send TA Jang a report

39 Parameter tuning Design your own select, feature, buy and sell programs. Upload and test them in our simulation system. Finally, commit your best version and send TA Jang a report before 23:59 11/5 (Mon). Machine Learning & Bioinformatics 39

40 Possible ways Enlarge parameter range in CV Stratified, repeated minimize the variance Make turning set use large training set; make tuning set as similar to the target stocks as possible Cost matrix resampling, otherwise it would be very difficult Change measures or plot ROC curves to understand your classifiers The best measure is the transaction profit, but it requires the simulation system. Instead, you can develop a comprising evaluation script, which is more complicated than any theoretic measures but simpler than the real problem. This is usually required in practice. 40

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART V Credibility: Evaluating what s been learned 10/25/2000 2 Evaluation: the key to success How