Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Similar documents
A very different type of Ansatz - Decision Trees

CS 229 Midterm Review

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Naïve Bayes for text classification

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Lecture 9: Support Vector Machines

Machine Learning Lecture 3

Support Vector Machines

Network Traffic Measurements and Analysis

Applying Supervised Learning

Supervised vs unsupervised clustering

The Curse of Dimensionality

Machine Learning Lecture 3

STA 4273H: Statistical Machine Learning

Machine Learning Lecture 3

Stat 342 Exam 3 Fall 2014

Support Vector Machines

DM6 Support Vector Machines

Performance Evaluation of Various Classification Algorithms

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Generative and discriminative classification techniques

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning and Pervasive Computing

Introduction to Support Vector Machines

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Machine Learning. Chao Lan

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

ECG782: Multidimensional Digital Signal Processing

732A54/TDDE31 Big Data Analytics

CSE 573: Artificial Intelligence Autumn 2010

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Semi-supervised learning and active learning

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Information Management course

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

An introduction to random forests

What is machine learning?

Support vector machines

Random Forest A. Fornaser

Machine Learning Techniques for Data Mining

Supervised Learning for Image Segmentation

Support vector machines

Lab 2: Support vector machines

Lecture 7: Decision Trees

Support Vector Machines

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Support Vector Machines + Classification for IR

Machine Learning for NLP

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

All lecture slides will be available at CSC2515_Winter15.html

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Machine Learning. Classification

CS570: Introduction to Data Mining

Data Mining Lecture 8: Decision Trees

Modern Methods of Data Analysis - WS 07/08

Content-based image and video analysis. Machine learning

Instance-based Learning

Support Vector Machines

Machine Learning. Supervised Learning. Manfred Huber

Slides for Data Mining by I. H. Witten and E. Frank

Support Vector Machines (a brief introduction) Adrian Bevan.

Supervised Learning Classification Algorithms Comparison

Classification. 1 o Semestre 2007/2008

SUPPORT VECTOR MACHINES

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Support Vector Machines

Support Vector Machines.

6. Linear Discriminant Functions

The exam is closed book, closed notes except your one-page cheat sheet.

SUPPORT VECTOR MACHINES

Chap.12 Kernel methods [Book, Chap.7]

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Stat 602X Exam 2 Spring 2011

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning Classifiers and Boosting

Classification: Linear Discriminant Functions

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Machine Learning / Jan 27, 2010

Neural Networks and Deep Learning

Machine Learning Software ROOT/TMVA

Data Mining in Bioinformatics Day 1: Classification

A Taxonomy of Semi-Supervised Learning Algorithms

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

More on Classification: Support Vector Machine

PV211: Introduction to Information Retrieval

Discriminative classifiers for image recognition

Machine Learning with MATLAB --classification

Transcription:

Multivariate Data Analysis and Machine Learning in High Energy Physics (V) Helge Voss (MPI K, Heidelberg) Graduierten-Kolleg, Freiburg, 11.5-15.5, 2009

Outline last lecture Rule Fitting Support Vector Machines Some toy demonstations General considerations + Systematic uncertainties Helge Voss 2

Last Lecture 3 (Boosted-) Decision trees Decision Tree: sequential application of splits (cuts) cut out an arbitrary volume formed out of little cubes in your parameter space a test event is classified as either signal or background depending on which cube (leaf node) it falls into average over many of those sets of such decision trees: Boosting create different trees using modifications of event weighs in training data AdaBoost, ε-boost, logit-boost Bagging, Random Forest the brute force method that often prooves very effective and robust although it has hundreds, thousands of free parameters. simple construction rules little chance to do things wrong (falling in local minima, be become confused by high dimensional problem with little data)

Learning with Rule Ensembles 4 Following RuleFit approach by Friedman-Popescu Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003 Model is linear combination of rules, where a rule is a sequence of cuts RuleFit classifier nr y x = a + r x + b xˆ MR ( ) a ( ˆ) RF 0 rules (cut sequence r m =1 if all cuts satisfied, =0 otherwise) m m k k m= 1 k= 1 normalised discriminating event variables Sum of rules Linear Fisher term The problem to solve is Create rule ensemble: use forest of decision trees Fit coefficients a m, b k : gradient direct regularization minimising Risk (Friedman et al.) Pruning removes topologically equal rules (same variables in cut sequence) One of the elementary cellular automaton rules (Wolfram 1983, 2002). It specifies the next color in a cell, depending on its color and its immediate neighbors. Its rule outcomes are encoded in the binary representation 30=00011110 2.

Support Vector Machines 5 If Neural Networks are complicated by finding the proper optimum weights for best separation power by wiggly functional behaviour of the piecewise defined separation hyperplane If KNN (multidimensional likelihood) suffers disadvantage that calculating the MVA-output of each test event involves evaluation of ALL training events If Boosted Decision Trees in theory are always weaker than a perfect Neural Network Try to get the best of all worlds

Support Vector Machine 6 There are methods to create linear decision boundaries using only measures of distances leads to quadratic optimisation processs The decision boundary in the end is defined only by training events that are closest to the boundary We ve seen that variable transformations, i.e moving into a higher dimensional space (i.e. using var1*var1 in Fisher Discriminant) can allow you to separate with linear decision boundaries non linear problems Support Vector Machine

Support Vector Machines 7 Find hyperplane that best separates signal from background Best separation: maximum distance (margin) between closest events (support) to hyperplane Linear decision boundary If data non-separable add misclassification cost parameter C Σ i ξ i to minimisation function Solution of largest margin depends only on inner product of support vectors (distances) Non-separable Separable data data x 2 support vectors optimal hyperplane ξ 2 ξ 1 ξ 3 ξ 4 margin x 1 quadratic minimisation problem

Support Vector Machines Find hyperplane that best separates signal from background Best separation: maximum distance (margin) between closest events (support) to hyperplane Linear decision boundary If data non-separable add misclassification cost parameter C Σ i ξ i to minimisation function Non-separable Separable data data x 3 φ(x 1,x 2 ) x 2 Solution depends only on inner product of support vectors quadratic minimisation problem Non-linear cases: Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data Explicit transformation doesn t need to be specified. Only need the scalar product (inner product) x x Ф(x) Ф(x). certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.: Gaussian, Polynomial, Sigmoid Choose Kernel and fit the hyperplane using the linear techniques developed above Helge Voss Kernel size paramter typically needs careful tuning! (Overtraining!) x 1 x 1 8

9 See some Classifiers at Work

Toy Examples: Linear-, Cross-, Circular Correlations Helge Voss 10 Illustrate the behaviour of linear and nonlinear classifiers Linear correlations (same for signal and background) Linear correlations (opposite for signal and background) Circular correlations (same for signal and background)

Weight Variables by Classifier Output 11 How well do the classifier resolve the various correlation patterns? Linear correlations (same for signal and background) Cross-linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Likelihood PDERS Fisher MLP BDT - D

General Advice for (MVA) Analyses 12 There is no magic in MVA-Methods: no need to be too afraid of black boxes you typically still need to make careful tuning and do some hard work The most important thing at the start is finding good observables good separation power between S and B little correlations amongst each other no correlation with the parameters you try to measure in your signal sample! Think also about possible combination of variables can this eliminate correlation Always apply straightforward preselection cuts and let the MVA only do the rest. Sharp features should be avoided numerical problems, loss of information when binning is applied simple variable transformations (i.e. log(var1) ) can often smooth out these areas and allow signal and background differences to appear in a clearer way

Some Words about Systematic Errors 13 Typical worries are: What happens if the estimated Probability Density is wrong? Can the Classifier, i.e. the discrimination function y(x), introduce systematic uncertainties? What happens if the training data do not match reality Any wrong PDF leads to imperfect discrimination function Imperfect (calling it wrong isn t right ) y(x) loss of discrimination power that s all! classical cuts face exactly the same problem, however: y(x) P(x S) = P(x B) in addition to cutting on features that are not correct, now you can also exploit correlations that are in fact not correct Systematic error are only introduced once Monte Carlo events with imperfect modeling are used for efficiency; purity #expected events same problem with classical cut analysis use control samples to test MVA-output distribution (y(x)) Combined variable (MVA-output, y(x)) might hide problems in ONE individual variable more than if looked at alone train classifier with few variables only and compare with data

14 Treatment of Systematic Uncertainties Is there a strategy however to become less sensitive to possible systematic uncertainties i.e. classically: variable that is prone to uncertainties do not cut in the region of steepest gradient classically one would not choose the most important cut on an uncertain variable Try to make classifier less sensitive to uncertain variables i.e. re-weight events in training to decrease separation in variables with large systematic uncertainty (certainly not yet a recipe that can strictly be followed, more an idea of what could perhaps be done) Calibration uncertainty may shift the central value and hence worsen (or increase) the discrimination power of var4

Treatment of Systematic Uncertainties 1 st Way Helge Voss 15 Classifier output distributions for signal only Classifier output distributions for signal only

Treatment of Systematic Uncertainties 2 nd Way Helge Voss 16 Classifier output distributions for signal only Classifier output distributions for signal only

Summary of Classifiers and their Properties 17 Classifiers Criteria Cuts Likelihood PDERS / k-nn H-Matrix Fisher MLP BDT RuleFit SVM Performance Speed Robust -ness no / linear correlations nonlinear correlations Training Response Overtraining Curse of dimensionality Transparency / Weak input variables The properties of the Function discriminant (FDA) depend on the chosen function A

Summary Traditional cuts are certainly not the most effective selection technique Multivariate Analysis Combination of variables (taking into account variable correlations) Optimal event selection is based on the likelihood ratio knn and Kernal Methods are attempts to estimate the probability density in D- dimensions naïve Bayesian or projective Likelihood ignores variable correlations Use de-correlation pre-processing of the data if appropriate More Classifiers: Fishers Linear discriminant simple and robust Neural networks very powerful but difficult to train (multiple local minima) Support Vector machines one global minimum but needs careful tuning Boosted Decision Trees a brute force method Avoid overtraining Systematic errors be as careful as with cuts and check against data Helge Voss 18