What are anomalies and why do we care?

Size: px
Start display at page:

Download "What are anomalies and why do we care?"

Transcription

1 Anomaly Detection Based on V. Chandola, A. Banerjee, and V. Kupin, Anomaly detection: A survey, ACM Computing Surveys, 41 (2009), Article 15, 58 pages.

2 Outline What are anomalies and why do we care? Different aspects of detection Anomaly detection applications Classification based anomaly detection methods Nearest neighbor based anomaly detection methods Clustering based anomaly detection methods Statistical anomaly detection methods Information theoretic methods Spectral anomaly detection methods Handling conceptual anomalities 2

3 What are anomalies and why do we care? Anomaly (outlier, discordant, exception, aberration, peculiarity, or contaminant) detection refers to finding patterns that have unexpected behavior. First studied in 1887 paper (Edgeworth) in statistics. Important since anomalies in data translate into significant and actionable information in apps. Useful in cyber-security, fraud or intrusion detection, health care, sensor or equipment failure, surveillance. Relevance or interestingness of anomalies is a feature. Related to noise accommodation/removal and novelty detection. 3

4 Major challenges to making detections: Defining a normal region that covers all normal behavior patterns. Anomalies due to malicious behavior will adapt over time to continue to be plagues. Normal behavior evolves over time. Different application domains have different views on what is an anomaly making a general theory difficult to establish. General lack of availability of useful data for training purposes in labeling algorithms. Noise is similar to an anomaly and leads to false detections. 4

5 Diversity of techniques: Statistics Machine learning Information theory Spectral theory Graph theory and topology which are used in different application areas and may be specific to a single one of them (i.e., not useful for developing an abstract theory). 5

6 Different aspects of anomaly detection A key aspect of any anomaly detection technique is the nature of the input data. Input is generally a collection of data instances (an object, record, point, vector, pattern, event, case, sample, observation, or entity). The attributes can be of different types such as binary, categorical, or continuous. Each data instance consists of one (univariate) or multiple (multivariate) attributes. Multiple data types are common in the multivariate case. 6

7 Attributes are key to which technique(s) to use in anomaly detection. Data instances cause relationships, too (e.g., sequence data, spatial data, and graphical data). In sequence data, data instances are linearly ordered. In spatial data, data instances are related to neighboring data points. Similarly with spatialtemporal data. In graphical data, data instances are related through vertices and edges. 7

8 Anomaly classification: 1. Point anomalies refer to when each data is independent of all of the others. 2. Contextual anomalies refer to when each data has a set of possible attributes. a. Contextual attributes: example is when latitude and longitude are attributes only. b. Behavioral attributes: example is average rainfall at a given point on the planet. 3. Collective anomalies refer to having multiple data instances with particular properties. An example is a sequence like ssh, buffer overflow, http-web, httpweb, ssh, buffer overflow, ftp, http-web, ftp, 8

9 Data labels are associated with data instances and can indicate many things including if the instance is normal or an anomaly. Getting highly labeled data is extremely expensive and is frequently achieved using training sets. Anomaly detection operates with labeled data in Supervised mode: a training set with normal and anomaly labeled data is available. Semisupervised mode: a training set with just normal labeled data is available. Unsupervised mode: no training set is required because it is assumed that normal data is far more common than anomalies. 9

10 Anomaly detection output using is one of only two forms: 1. Scores: Each instance receives a numeric score. Analysts will usually only look at high scored data to verify anomalies. Domain specific scoring thresholds are common and useful. 2. Labels: Normal or anomaly is assigned to each test instance. 10

11 Anomaly detection applications We consider a number of applications: Intrusion detection Fraud detection Medical and health anomaly detection Industrial damage detection Image processing Text processing Sensor networks There are many more application fields that can be considered with specialized methodologies. 11

12 Intrusion detection Intrusion detection refers to detection of malicious activity (break-ins, hacking attempts, penetrations, and other forms of computer abuse) on computer systems. Key challenges: Huge volume of data. The anomaly detection techniques must be computationally efficient. Data typically streamed and requires online analysis. False alarm rate can be too high. 12

13 Host based intrusion detection systems are required to handle the sequential nature of data. Moreover, point anomaly detection techniques are not applicable in this domain. The techniques have to either model the sequence data or compute similarity between sequences. Network based intrusion detection systems deal with network data. The intrusions typically occur as point anomalies though certain techniques model the data in a sequential fashion and detect collective anomalies. The big challenge here is that the anomalies evolve over time as hackers refine their techniques. 13

14 Methods used in intrusion detection: Statistical profiling using histograms Parametric or nonparametric statistical modeling Bayesian networks Mixture of models Neural networks Support vector machines Rule-based systems Clustering based Nearest neighbor based Spectral Information theoretic 14

15 Fraud detection Fraud detection refers to detection of criminal activities occurring in commercial organizations such as banks, credit card companies, insurance agencies, cell phone companies, stock market, etc. and may involve employees or customers. Immediate detection is wanted. Credit card fraud: Data is multidimensional (user id, amount spent, frequency of use, location, distance between last location, time between last use, history of items purchased in the past, ). 15

16 Point anomaly techniques typically used. By owner and by operation methods used. Want detection during the first fraudulent transaction. Do not want to irritate cardholder with false alarms that freeze a card (this can be really irritating when overseas during a hotel checkout or conference registration). Methods used in credit card fraud detection: Neural networks Rule based systems Clustering based 16

17 Mobile phone fraud detection The task is to scan a large set of accounts, examining the calling behavior of each, and to issue an alarm when an account appears to have been misused. Methods used in mobile phone fraud detection: Statistical profiling using histograms Parametric statistical modeling Neural networks Rule based systems Insurance claim fraud detection handled similarly. 17

18 Insider trading detection This refers to the knowledge of a pending merger, acquisition, a terrorist attack affecting a particular industry, pending legislation affecting a particular industry, or any information that would affect the stock prices in a particular industry. Insider trading can be detected by identifying anomalous trading activities in the regular and options markets and tax declarations. Methods used in insider trading detection: Statistical profiling using histograms Information theoretic 18

19 Medical and public health anomaly detection It works with patient records. The data can have anomalies due to several reasons, e.g., abnormal patient condition, instrumentation errors, or recording errors. Several techniques have also focused on detecting disease outbreaks in a specific area. Methods used in medical and public health anomaly detection: Parametric statistical modeling Neural networks Rule based systems 19

20 Bayesian networks Nearest neighbor based Industrial damage detection Industrial units suffer damage due to continuous usage and normal wear and tear. Such damage needs to be detected early to prevent further escalation and losses. The data in this domain is usually referred to as sensor data because it is recorded using different sensors and collected for analysis. There are two categories: 20

21 1. Fault Detection in Mechanical Units 2. Structural Defect Detection The methods used in industrial damage detection: Statistical profiling using histograms Parametric or nonparametric statistical modeling Bayesian networks Mixture of models Neural networks Rule-based systems Spectral 21

22 Image processing Look for interesting features, e.g., video surveillance, satellite imagery, xrays/ct scans/ Methods used in image processing: Bayesian networks Mixture of models Regression Neural networks Support vector machines Clustering based 22

23 Nearest neighbor based Text data anomaly detection Detect novel topics or events or news stories in a collection of documents or news articles. The anomalies are caused due to a new interesting event or an anomalous topic. The data in this domain is typically high dimensional and very sparse. The data also has a temporal aspect since the documents are collected over time. 23

24 Methods used in text data anomaly detection: Statistical profiling using histograms Mixture of models Neural networks Support vector machines Clustering based Sensor networks Anomalies in data collected from a sensor network can either mean that one or more sensors are faulty or they are detecting events (e.g., intrusions). Anomaly detection 24

25 in sensor networks can capture sensor fault detection or intrusion detection or both. By definition, the data is collected online and in a distributed manner. Distributed data mining is used. Methods used in sensor networks: Parametric statistical modeling Bayesian networks Nearest neighbor based Rule-based systems Spectral 25

26 Classification based anomaly detection methods Classification-based anomaly detection techniques operate in a similar two-phase fashion. The training phase learns a classifier using the available labeled training data. The testing phase classifies a test instance as normal or anomalous, using the classifier. Assumption: A classifier that can distinguish between normal and anomalous classes can be learned in the given feature space. 26

27 One-class classification based anomaly detection techniques assume that all training instances have only one class label. Multi-class classification based anomaly detection techniques assume that the train- ing data contains labeled instances belonging to multiple normal classes. Each class is tested for normalness inside a confidence level. Test instances are anomalous if no test is normal. 27

28 Neural networks based A basic multi-class anomaly detection technique using neural networks operates in two steps: 1. A neural network is trained on the normal training data to learn the different normal classes. 2. Each test instance is provided as an input to the neural network. Replicator Neural Networks have been used for one-class anomaly detection. If the neural net accepts the test input, it is normal. 28

29 Some neural net classification methods: Multi-layered perceptrons Neural trees Auto-associative networks Adaptive resonance theory based Radial basis function based Hopfield networks Oscillatory networks 29

30 Bayesian networks based The different attributes are assumed independent. This is a basic technique for a univariate categorical dataset. Given a test data instance it uses a native Bayesian network to estimate the posterior probability of observing a class label from a set of normal class labels and the anomaly class label. The class label with largest posterior is chosen as the predicted class for the given test instance. The likelihood of observing the test instance given a class and the prior on the class probabilities is estimated from the training data set. 30

31 Support vector machines based This method is applied one class setting and learn a region that contains the training data instances (a boundary). Kernels, such as radial basis function kernel, can be used to learn complex regions. For each test instance, the basic technique determines if the test instance falls within the learned region. If a test instance falls within the learned region, it is declared as normal. Otherwise it is an anomaly. Audio anomaly detection is one of the major uses of this method. 31

32 Rule based This method learns rules that capture the normal behavior of a system. A test instance that is not covered by any such rule is considered as an anomaly. Rule based techniques have been applied in both one class and multiclass settings. A basic multi-class rule-based technique consists of two steps: 1. Learn rules (each with a confidence level) from the training data using a rule learning algorithm. 32

33 2. Find for each test instance the rule that best captures the test instance. Complexity: This is dependent on which classification algorithm is used. Decision trees are faster than SVMs. +/- of classification based methods: + Classification-based techniques (especially the multi-class techniques) can use powerful algorithms that can distinguish between instances belonging to different classes. 33

34 + The testing phase of classification based methods is fast since each test instance needs to be compared against only a precomputed model. Multi-class classification based methods rely on the availability of accurate labels for various normal classes, which is often impossible. Classification based methods assign a label to each test instance, which can also become a disadvantage when a meaningful anomaly score is desired for the test instances. 34

35 Nearest neighbor based anomaly detection methods Assumption. Normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors. Nearest neighbor based anomaly detection methods can be broadly grouped into two categories: 1. Methods that use the distance of a data instance to its kth nearest neighbor as the anomaly score; 2. Methods that compute the relative density of each data instance to compute its anomaly score. 35

36 Using distance to kth nearest neighbor The anomaly score of a data instance is defined as its distance to its kth nearest neighbor in a given data set. Three extensions: 1. Modify the definition to obtain the anomaly score of a data instance. 2. Use different distance/similarity measures to handle different data types. 3. Improve the efficiency of the basic technique: the complexity of the basic technique is O(N 2 ), where N is the data size. Use faster methods. 36

37 For point 3, prune the search space by either ignoring instances that cannot be anomalous or by focusing on instances that are most likely to be anomalous. Using relative density Density based anomaly detection techniques estimate the density of the neighborhood of each data instance: Low density implies an anomaly. High density implies normal. Density based techniques perform poorly if the data has regions of varying densities. Approaches that try to 37

38 weigh the relative weights of neighboring dense neighborhoods have been developed. Complexity: O(N 2 ) +/- of nearest neighbor based methods: + They are unsupervised in nature and do not make any assumptions regarding the generative distribution for the data. Instead, they are purely data driven. + Semisupervised techniques perform better than unsupervised techniques in terms of missed anomalies, since the likelihood that an anomaly will 38

39 form a close neigh- borhood in the training data set is very low. + Adapting these methods to a different data type is easy: modify the measure. Missed anomalies for unsupervised methods: if the data has normal instances that do not have enough close neighbors, or if the data has anomalies that have enough close neigh- bors, the technique fails to label them correctly. Many false positives for semi-supervised methods: if the normal instances in the test data do not have enough similar normal instances in the training data. 39

40 The computational complexity of the testing phase is also a significant challenge since it involves computing the distance of each test instance. The performance of a nearest neighbor based technique greatly relies on a distance measure. 40

41 Clustering based anomaly detection methods Clustering is used to group similar data instances into clusters. Clustering is primarily an unsupervised technique though semi-supervised clustering has also been explored lately. Three formulations based on different assumptions: 1. Normal data instances belong to a cluster in the data, while anomalies do not belong to any cluster. 2. Normal data instances lie close to their closest cluster centroid, while anomalies are far away from their closest cluster centroid. 41

42 3. Normal data instances belong to large and dense clusters, while anomalies either belong to small or sparse clusters. Assumption 1 remarks: Apply a known clustering based algorithm to the data set and declare any data instance that does not belong to any cluster as anomalous. A disadvantage of such techniques is that they are not optimized to find anomalies, since the main aim of the underlying clustering algorithm is to find clusters. Assumption 2 remarks: 42

43 Methods consist of two steps: (1) the data is clustered using a clustering algorithm, and (2) for each data instance, its distance to its closest cluster centroid is calculated as its anomaly score. Can operate in semi-supervised mode. If the anomalies in the data form clusters by themselves, these techniques will not be able to detect such anomalies. Assumption 3 remarks: Methods declare instances belonging to clusters whose size or density is below a threshold as anomalous. 43

44 There are linear time algorithms. There are many similarities between clustering and nearest neighbor based anomaly detection methods. Complexity: Depends on the training and detection algorithms, but there are a few O(N) ones. +/- of clustering based methods: + Unsupervised mode is viable. + Complex data types handled by using a clustering algorithm that can handle the particular data type. 44

45 + The testing phase is fast since the number of clusters against which every test instance needs to be compared is a small constant. Performance is highly dependent on the effectiveness of clustering algorithms in capturing the cluster structure of normal instances. Many techniques detect anomalies as a byproduct of clustering, and hence are not optimized for anomaly detection. Miss anomalies: some clustering algorithms force every instance to be assigned to some cluster. Anomaly clusters are clusters, so missed. Sloooow: O(dN), where d= data. 45

46 Statistical anomaly detection methods Underlying principle: An anomaly is an observation that is suspected of being partially or wholly irrelevant because it is not generated by the stochastic model assumed. Assumption. Normal data instances occur in high probability regions of a stochastic model while anomalies occur in low probability regions of the stochastic model. Statistical techniques fit a statistical model (usually for normal behavior) to the given data and then apply a 46

47 statistical inference test to determine if an unseen instance belongs to this model or not. Parametric Methods Parametric methods assume that the normal data is generated by a parametric distribution with parameters θ and probability density function f (x,θ), where x is an observation. The anomaly score of a test instance (or observation) x is the inverse of the probability density function f (x,θ). The parameters θ are estimated from the given data. 47

48 Gaussian modeling based Such methods assume that the data is generated from a Gaussian distribution. The parameters are estimated using Maximum Likelihood Estimates (MLE). The distance of a data instance to the estimated mean is the anomaly score for that instance. A threshold is applied to the anomaly scores to determine the anomalies. Different techniques in this category calculate the distance to the mean and the threshold in different ways. Statistical rules commonly used: 48

49 Box plot rule Grubb s test Student t-test χ 2 test Regression model based Two steps: 1. A regression model is fitted on the data. 2. For each test instance, the residual for the test instance is used to determine the anomaly score. 49

50 The residual is the part of the instance that is unexplained by the regression model. The magnitude of the residual can be used as the anomaly score for the test instance, though statistical tests have been proposed to determine anomalies with certain confidence. Another variant that detects anomalies in multivariate time-series data is generated by an Autoregressive Moving Average (ARMA) model. 50

51 Mixture of parametric distributions based Such methods use a mixture of parametric statistical distributions to model the data. Two subcategories: 1. Those that model the normal instances and anomalies as separate parametric distributions. 2. Those that model only the normal instances as a mixture of parametric distributions. Subcategory remarks: 1. The testing phase involves determining which distribution, normal or anomalous, the test instance belongs to. 51

52 2. Model the normal instances as a mixture of parametric distributions. A test instance that does not belong to any of the learned models is declared to be an anomaly. Nonparametric methods These methods use nonparametric statistical models, such that the model structure is not defined a prioiri, but is instead determined dynamically from the data. These methods make fewer assumptions regarding the data, (e.g., smoothness of density) when compared to parametric techniques. 52

53 Histogram based The simplest nonparametric statistical method is to use histograms to maintain a profile of the normal data. The size of the bin used when building the histogram is key for anomaly detection: Too small: many normal test instances will fall in empty or rare bins, resulting in a high false alarm rate. Too large: many anomalous test instances will fall in frequent bins, resulting in a high false negative rate. 53

54 For univariate data there are two steps: 1. Build a histogram based on the different values taken by that feature in the training data. 2. Check if a test instance falls in any one of the bins of the histogram. If it does, then the test instance is normal. Otherwise it is anomalous. For multivariate data, a basic technique is to construct attributewise histograms. During testing, for each test instance the anomaly score for each attribute value of the test instance is calculated as the height of the bin that contains the attribute value. 54

55 The per attribute anomaly scores are aggregated to obtain an overall anomaly score for the test instance. Complexity: Completely dependent on the method(s) used. Good luck. +/- of statistical based methods: + Assumptions true, statistics true + Confidence levels provide, well, confidence + Unsupervised mode works if the distribution estimation step is robust to anomalies in data. Methods rely on the assumption that the data is generated from a particular distribution. This 55

56 assumption often does not hold true, especially for high dimensional real data sets. What was that famous quote about statistics??? Histograms are simple to implement and easily lie about the results. You get what you pay for. 56

57 Information theoretic methods Analyze the information content of a data set using different information theoretic measures such as Kolomogorov complexity, entropy, relative entropy, etc. Assumption: Anomalies in data induce irregularities in the information content of the data set. Let C(D) denote the complexity of a given data set, D. A basic information theoretic technique can be described as follows. Given a data set D, find the minimal subset of instances, I, such that C(D) C(D I) is maximum. All 57

58 instances in the subset thus obtained, are deemed as anomalous. The problem addressed by this basic technique is to find a Pareto-optimal solution, which does not have a single optimum, since there are two different objectives that need to be optimized. Complexity: This has exponential time complexity. Never, ever use it unless you have no other choice. +/- of information theoretic based methods: + Unsupervised mode works like a charm. + No assumptions about the underlying data. 58

59 The performance of such techniques is highly dependent on the choice of the information theoretic measure. Only with huge numbers of anomalies is even one found. Information theoretic techniques applied to sequences (and spatial data sets rely on the size of the substructure) are often nontrivial to obtain. It is difficult to associate an anomaly score with a test instance using this method. 59

60 Spectral anomaly detection methods Spectral techniques try to find an approximation of the data using a combination of attributes that capture the bulk of the variability in the data. Assumption. Data can be embedded into a lower dimensional subspace in which normal instances and anomalies appear significantly different. Determine such subspaces that the anomalous instances can be easily identified. Such techniques can work in an unsupervised as well as a semi-supervised setting. 60

61 Principal component analysis is the major algorithm used. Complexity: Typically O(d), but O(dim 2 ). Singular value decompositions are frequently used and O(N 2 ). +/- of spectral based methods: + Spectral techniques automatically perform dimensionality reduction and are suitable for handling high dimensional data sets. + They can be used as a preprocessing step followed by application of any existing anomaly detection technique in the transformed space. 61

62 + They can be used in an unsupervised setting. Spectral techniques are useful only if the normal and anomalous instances are separable in the lower dimensional embedding of the data. Have high computational complexity. 62

63 Defining concepts: Spatial Graphs Sequential Profile Handling conceptual anomalies There is very little literature in this area. It is rip for Ph.D. dissertations. 63

64 Quick summary A general theory is still an open research problem that will reward numerous students with Ph.D.s in the future. That said, there are many areas that have been developed over a long time. 64

Automatic Detection Of Suspicious Behaviour

Automatic Detection Of Suspicious Behaviour University Utrecht Technical Artificial Intelligence Master Thesis Automatic Detection Of Suspicious Behaviour Author: Iris Renckens Supervisors: Dr. Selmar Smit Dr. Ad Feelders Prof. Dr. Arno Siebes September

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2016 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,

More information

Outlier Detection. Chapter 12

Outlier Detection. Chapter 12 Contents 12 Outlier Detection 3 12.1 Outliers and Outlier Analysis.................... 4 12.1.1 What Are Outliers?..................... 4 12.1.2 Types of Outliers....................... 5 12.1.3 Challenges

More information

Large Scale Data Analysis for Policy

Large Scale Data Analysis for Policy Large Scale Data Analysis for Policy 90-866, Fall 2012 Lecture 9: Anomaly and Outlier Detection Parts of this lecture were adapted from Banerjee et al., Anomaly Detection: A Tutorial, presented at SDM

More information

UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly

UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly UNSUPERVISED LEARNING FOR ANOMALY INTRUSION DETECTION Presented by: Mohamed EL Fadly Outline Introduction Motivation Problem Definition Objective Challenges Approach Related Work Introduction Anomaly detection

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Anomaly Detection in Categorical Datasets with Artificial Contrasts. Seyyedehnasim Mousavi

Anomaly Detection in Categorical Datasets with Artificial Contrasts. Seyyedehnasim Mousavi Anomaly Detection in Categorical Datasets with Artificial Contrasts by Seyyedehnasim Mousavi A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved October

More information

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Anomaly Detection. You Chen

Anomaly Detection. You Chen Anomaly Detection You Chen 1 Two questions: (1) What is Anomaly Detection? (2) What are Anomalies? Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2015 11. Non-Parameteric Techniques

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2018 Admin Assignment 2 is due Friday. Assignment 1 grades available? Midterm rooms are now booked. October 18 th at 6:30pm (BUCH A102

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes. Outliers Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Concepts What is an outlier? The set of data points that are considerably different than the remainder of the

More information

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2011 11. Non-Parameteric Techniques

More information

OUTLIER DATA MINING WITH IMPERFECT DATA LABELS

OUTLIER DATA MINING WITH IMPERFECT DATA LABELS OUTLIER DATA MINING WITH IMPERFECT DATA LABELS Mr.Yogesh P Dawange 1 1 PG Student, Department of Computer Engineering, SND College of Engineering and Research Centre, Yeola, Nashik, Maharashtra, India

More information

Detection of Anomalies using Online Oversampling PCA

Detection of Anomalies using Online Oversampling PCA Detection of Anomalies using Online Oversampling PCA Miss Supriya A. Bagane, Prof. Sonali Patil Abstract Anomaly detection is the process of identifying unexpected behavior and it is an important research

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Chapter 5: Outlier Detection

Chapter 5: Outlier Detection Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Detection and Deletion of Outliers from Large Datasets

Detection and Deletion of Outliers from Large Datasets Detection and Deletion of Outliers from Large Datasets Nithya.Jayaprakash 1, Ms. Caroline Mary 2 M. tech Student, Dept of Computer Science, Mohandas College of Engineering and Technology, India 1 Assistant

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS OUTLIER MINING IN HIGH DIMENSIONAL DATASETS DATA MINING DISCUSSION GROUP OUTLINE MOTIVATION OUTLIERS IN MULTIVARIATE DATA OUTLIERS IN HIGH DIMENSIONAL DATA Distribution-based Distance-based NN-based Density-based

More information

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017 CPSC 340: Machine Learning and Data Mining Hierarchical Clustering Fall 2017 Assignment 1 is due Friday. Admin Follow the assignment guidelines naming convention (a1.zip/a1.pdf). Assignment 0 grades posted

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,

More information

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning. Supervised Learning. Manfred Huber Machine Learning Supervised Learning Manfred Huber 2015 1 Supervised Learning Supervised learning is learning where the training data contains the target output of the learning system. Training data D

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Lab 9. Julia Janicki. Introduction

Lab 9. Julia Janicki. Introduction Lab 9 Julia Janicki Introduction My goal for this project is to map a general land cover in the area of Alexandria in Egypt using supervised classification, specifically the Maximum Likelihood and Support

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Chapter 9: Outlier Analysis

Chapter 9: Outlier Analysis Chapter 9: Outlier Analysis Jilles Vreeken 8 Dec 2015 IRDM Chapter 9, overview 1. Basics & Motivation 2. Extreme Value Analysis 3. Probabilistic Methods 4. Cluster-based Methods 5. Distance-based Methods

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017 Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis

More information

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification CAMCOS Report Day December 9 th, 2015 San Jose State University Project Theme: Classification On Classification: An Empirical Study of Existing Algorithms based on two Kaggle Competitions Team 1 Team 2

More information

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Data Mining Classification - Part 1 -

Data Mining Classification - Part 1 - Data Mining Classification - Part 1 - Universität Mannheim Bizer: Data Mining I FSS2019 (Version: 20.2.2018) Slide 1 Outline 1. What is Classification? 2. K-Nearest-Neighbors 3. Decision Trees 4. Model

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2016 Admin Assignment 1 solutions will be posted after class. Assignment 2 is out: Due next Friday, but start early! Calculus and linear

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\ Data Preprocessing S - MAI AMLT - 2016/2017 (S - MAI) Data Preprocessing AMLT - 2016/2017 1 / 71 Outline 1 Introduction Data Representation 2 Data Preprocessing Outliers Missing Values Normalization Discretization

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Hierarchical Clustering and Outlier Detection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 2 is due

More information

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report Abstract The goal of influence maximization has led to research into different

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

Data Preprocessing. Data Mining 1

Data Preprocessing. Data Mining 1 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.

More information

Computational Statistics and Mathematics for Cyber Security

Computational Statistics and Mathematics for Cyber Security and Mathematics for Cyber Security David J. Marchette Sept, 0 Acknowledgment: This work funded in part by the NSWC In-House Laboratory Independent Research (ILIR) program. NSWCDD-PN--00 Topics NSWCDD-PN--00

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques. . Non-Parameteric Techniques University of Cambridge Engineering Part IIB Paper 4F: Statistical Pattern Processing Handout : Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 23 Introduction

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Machine Learning : Clustering, Self-Organizing Maps

Machine Learning : Clustering, Self-Organizing Maps Machine Learning Clustering, Self-Organizing Maps 12/12/2013 Machine Learning : Clustering, Self-Organizing Maps Clustering The task: partition a set of objects into meaningful subsets (clusters). The

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning Robot Learning 1 General Pipeline 1. Data acquisition (e.g., from 3D sensors) 2. Feature extraction and representation construction 3. Robot learning: e.g., classification (recognition) or clustering (knowledge

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING Abhinav Kathuria Email - abhinav.kathuria90@gmail.com Abstract: Data mining is the process of the extraction of the hidden pattern from the data

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所 One-class Problems and Outlier Detection 陶卿 Qing.tao@mail.ia.ac.cn 中国科学院自动化研究所 Application-driven Various kinds of detection problems: unexpected conditions in engineering; abnormalities in medical data,

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Support vector machines

Support vector machines Support vector machines Cavan Reilly October 24, 2018 Table of contents K-nearest neighbor classification Support vector machines K-nearest neighbor classification Suppose we have a collection of measurements

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Machine Learning - Clustering. CS102 Fall 2017

Machine Learning - Clustering. CS102 Fall 2017 Machine Learning - Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2015 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows K-Nearest

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Machine Learning Lecture 3 Probability Density Estimation II 19.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Exam dates We re in the process

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Classifiers and Detection. D.A. Forsyth

Classifiers and Detection. D.A. Forsyth Classifiers and Detection D.A. Forsyth Classifiers Take a measurement x, predict a bit (yes/no; 1/-1; 1/0; etc) Detection with a classifier Search all windows at relevant scales Prepare features Classify

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information