Preprocessing. Knowledge Discovery and Data Mining 1. Roman Kern ISDS, TU Graz. Roman Kern (ISDS, TU Graz) Preprocessing / 63

Similar documents
Chapter 5: Outlier Detection

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

Anomaly Detection on Data Streams with High Dimensional Data Environment

DATA MINING II - 1DL460

Table Of Contents: xix Foreword to Second Edition

Contents. Foreword to Second Edition. Acknowledgments About the Authors

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

Contents. Preface to the Second Edition

What are anomalies and why do we care?

Data Preprocessing. Slides by: Shree Jaswal

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Search Engines. Information Retrieval in Practice

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

K236: Basis of Data Science

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Knowledge Discovery and Data Mining 1 (VO) ( )

Chapter 9: Outlier Analysis

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Part I: Data Mining Foundations

Clustering algorithms and autoencoders for anomaly detection

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Information Retrieval Spring Web retrieval

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

CPSC 340: Machine Learning and Data Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Automatic Detection Of Suspicious Behaviour

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Slides for Data Mining by I. H. Witten and E. Frank

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Network Traffic Measurements and Analysis

D B M G Data Base and Data Mining Group of Politecnico di Torino

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Machine Learning Techniques for Data Mining

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Outlier detection using autoencoders

Data Preprocessing. Data Mining 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Warehousing and Machine Learning

Outlier Detection. Chapter 12

9. Conclusions. 9.1 Definition KDD

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

University of Florida CISE department Gator Engineering. Clustering Part 5

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Chapter 1, Introduction

Introduction to Data Mining and Data Analytics

Information Retrieval May 15. Web retrieval

Web Crawlers Detection. Yomna ElRashidy

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Weka ( )

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Using Machine Learning to Optimize Storage Systems

ECS 234: Data Analysis: Clustering ECS 234

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Data Mining and Analytics. Introduction

Pre-Requisites: CS2510. NU Core Designations: AD

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Preprocessing DWML, /33

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Kapitel 4: Clustering

Detection of Anomalies using Online Oversampling PCA

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN

UNIT 2 Data Preprocessing

PUBCRAWL: Protecting Users and Businesses from CRAWLers

A Dendrogram. Bioinformatics (Lec 17)

Lab 9. Julia Janicki. Introduction

Creating a Classifier for a Focused Web Crawler

Predicting Popular Xbox games based on Search Queries of Users

Support Vector Machines + Classification for IR

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Machine Learning - Clustering. CS102 Fall 2017

DATA MINING II - 1DL460. Spring 2014"

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Machine Learning Methods in Visualisation for Big Data 2018

Machine Learning using MapReduce

Data mining fundamentals

Summary. Machine Learning: Introduction. Marcin Sydow

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Associative Cellular Learning Automata and its Applications

Transcription:

Preprocessing Knowledge Discovery and Data Mining 1 Roman Kern ISDS, TU Graz 2017-10-12 Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 1 / 63

Big picture: KDDM Probability Theory Linear Algebra Hardware & Programming Model Information Theory Statistical Inference Mathematical Tools Infrastructure Knowledge Discovery Process Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 2 / 63

Outline 1 Introduction 2 Web Crawling 3 Data Cleaning 4 Outlier Detection Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 3 / 63

Introduction Introduction Data acquisition & pre-processing Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 4 / 63

Introduction Introduction Initial phase of the Knowledge Discovery process acquire the data to be analysed eg by crawling the data from the Web prepare the data eg by cleaning and removing outliers Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 5 / 63

Web Crawling Web Crawling Acquire data from the Web Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 6 / 63

Web Crawling Motivation for Web crawling Example: Question: How does a search engine know that all these pages contain the query terms? Answer: Because all of those pages have been crawled! Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 7 / 63

Web Crawling Motivation for Web crawling Use Cases General web search engines (eg Google, Yandex, ) Vertical search engines (eg Yelp) Business Intelligence Online Reputation Management Data set generation Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 8 / 63

Web Crawling Names for Web crawling A web crawler is a specialised Web client that uses the HTTP protocol There are many different names for Web crawlers: Crawler, Spider, Robot (or bot), Web agent, Web scutter, Wanderer, worm, ant, automatic indexer, scraper, Well known instances: googlebot, scooter, slurp, msnbot, Many libraries, eg Heretrix Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 9 / 63

Web Crawling Simple Web crawling schema Basic Idea The crawler starts at an initial web page The web page is downloaded and its content gets analysed typically the web page will be HTML The links within the web page are being extracted All the links are candidates for the next web pages to be crawled Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 10 / 63

Web Crawling Simple Web crawling schema Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 11 / 63

Web Crawling Types of crawlers Batch crawler - snapshot of the current state, typically until a certain threshold is reached Incremental crawler - revisiting URLs to keep up to date Specialised crawlers, eg focused crawler, topical crawler Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 12 / 63

Web Crawling Challenges of web crawling The large volume of the Web The volatility of the Web, eg many Web pages change frequently Dynamic Web pages, which are rendered in the client including dynamically generated URLs Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 13 / 63

Web Crawling Challenges of web crawling (cont) Avoid crawling the same resources multiple times, eg normalise/canonicalise URLs Cope with errors in downloading, eg slow, unreliable connections Detect redirect loops Memory consumption, eg large frontier Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 14 / 63

Web Crawling Challenges of web crawling (cont) Deal with many content types, eg HTML, PDF, Flash Gracefully parse invalid content, eg missing closing tags in HTML Identify the structure of Web pages, eg main text, navigation, Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 15 / 63

Web Crawling Web crawling Extract structured information Usually the information is embedded in HTML tailored towards being displayed but crawlers would prefer to have the data already in a structured way Semantic Web, highly structured, little uptake Microformats, less structured, but more uptake Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 16 / 63

Web Crawling Web crawling & semantic web The Semantic Web should aid the process of Web crawling As it is targeted at making the Web machine readable Web pages expose their content typically as RDF (Resource Description Language) instead of the human readable HTML, eg depending on the User Agent specialised crawlers for the Semantic Web Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 17 / 63

Web Crawling Microformats Microformats as a lightweight alternative to the Semantic Web Embedded as HTML markup Supported by the major search engines http://microformatsorg eg All 16+ billion OpenStreetMap nodes have a geo microformat Example: Taken from openstreetmaporg <div class="geo"> <a href="/?lat=470591997&lon=154632963&zoom=18"> <span class="latitude">470591997</span>, <span class="longitude">154632963</span> </a> </div> Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 18 / 63

Web Crawling Crawling strategies Two main approaches: Breath first search Data structure: Queue (FIFO) Keeps shortest path to start Depth first search Data structure: Stack (LIFO) Quickly moves away from initial start node Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 19 / 63

Web Crawling Concurrent crawlers Run crawlers on multiple machines Even geographically dispersed shared data structures need to be synchronised Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 20 / 63

Web Crawling Deep crawling Deep Web Also called hidden Web (in contrast to the surface Web) Consider a Web site that contains a form for the user to fill out eg a search input box The task of the deep crawler is to fill out this box automatically and crawl the result Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 21 / 63

Web Crawling Topical crawler Topical Crawler Application: On-the-fly crawling of the Web Starting point: small set of seed pages Crawler tries to find similar pages Seed pages are used as reference Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 22 / 63

Web Crawling Focused crawler Focused Crawler Application: collect pages with specific properties, eg thematic, type For example: find all Blogs that talk about football where Blog is the type and football is the topic Predict how well the pages in the frontier match the criteria Typically uses a manually assembled training data set classification The distinction between topical crawler and focused crawler is not conclusive in the literature Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 23 / 63

Web Crawling Focused crawler Cues to predict relevant pages 1 Lexical, eg the textual content of a page 2 Link topology, eg the structure of the hyperlinks Cluster hypothesis: pages lexically (or topologically) close to a relevant page is also relevant with high probability Need to address two issues: 1 Link-content conjecture 2 Link-cluster conjecture Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 24 / 63

Web Crawling Focused crawler Link-content conjecture Are two pages that link to each other more likely to be lexically similar? Decay of the cosine similarity as a function of their mean directed link distance Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 25 / 63

Web Crawling Focused crawler Link-cluster conjecture Are two pages that link to each other more likely to be semantically related? Decay in mean likelihood ration as a function of mean directed link distance, starting from Yahoo! directory topics Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 26 / 63

Web Crawling Evaluation Examples to measure and compare the performance of crawlers: Harvest rate Percentage of good pages Search length Number of pages to be crawled before a certain percentage of relevant pages are found Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 27 / 63

Web Crawling Web information extraction Web information extraction is the problem of extracting target information item from Web pages Two problems 1 Extract information from natural language text 2 Extract structured data from Web pages The first problem will be presented in the upcoming week, the second in the next minutes Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 28 / 63

Web Crawling Web information extraction Web information extraction via structure Motivation: Often pages on the Web are generated out of databases Data records are thereby transformed via templates into web pages For example: Amazon product lists & product pages Task: Extract the original data record out of the Web page This task is often called wrapper generation Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 29 / 63

Web Crawling Wrapper generation Three basic approaches for wrapper generation: 1 Manual - simple approach, but does not scale for many sites 2 Wrapper induction - supervised approach 3 Automatic extraction - unsupervised approach We will have a look at the wrapper induction Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 30 / 63

Web Crawling Wrapper generation Wrapper induction Needs manually labelled training examples Learn a classification algorithm A simple approach: Web page is represented by a sequence of tokens Idea of landmarks: locate the beginning and end of a target item Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 31 / 63

Web Crawling Wrapper generation Manually labelling is tedious work Idea: reduce the amount of work by intelligently selecting the training examples Active Learning approach In the case of simple wrapper induction use co-training: Search landmarks from the beginning and from the back at the same time Use disagreement as indicator for a training example to annotate Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 32 / 63

Web Crawling Web crawler and ethics Web crawlers may cause trouble If too many requests are sent to a single Web site it might look like a denial of service (DoS) attack the source IP will be blacklisted Respect the robotstxt file (but it s not a legal requirement) Some bot disguise themselves and try to replicate a user s behaviour Some server disguise themselves, eg cloaking (various versions of the same Web page for different clients) Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 33 / 63

Web Crawling Web crawler and ethics Example: orfat/robotstxt # do not index the light version User-agent: * Disallow: /l/stories/ Disallow: /full # these robots have been bad once: user-agent: stress-agent Disallow: / User-agent: fast Disallow: / User-agent: Scooter Disallow: / Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 34 / 63

Data Cleaning Data Cleaning Filter out unwanted data Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 35 / 63

Data Cleaning Motivation Often data sets will contain: Unnecessary data Missing values Noise Incorrect data Inconsistent data Formatting issues Duplicate information Disguised data These factors will have an impact on the results of the data mining process Garbage in garbage out Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 36 / 63

Data Cleaning Unnecessary data Remove excess information Identify which parts contain relevant information Depends on the final purpose of the KD process In case of Web pages: Get rid of navigation, ads, Identify the main content of a page Identify the main images Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 37 / 63

Data Cleaning Unnecessary data Web page cleaning Figure: left: original Web page, right: applied goose to extract the article text, article image, meta-data, Try out Sensium: https://wwwsensiumio/ Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 38 / 63

Data Cleaning Missing values Sources of missing values: faulty equipment, human errors, anonymous data, eg Consider a data set consisting of multiple rows, and in some of the rows some values are missing Implications of missing data [Barnard&Meng1999] Loss of efficiency, due to less data Some methods may not handle missing data Bias in the (data mining) results Missing values may indicate errors in the data, but not necessarily so Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 39 / 63

Data Cleaning Missing values Categorisation of missing data [Little&Rubin1987] MCAR - Missing completely at random, does not depend on the observed or missing data MAR - Missing at random, depends on the observed data, but not the missing data NMAR - Not missing at random, depends on the missing data Need appropriate methods for each of the different types Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 40 / 63

Data Cleaning Missing values How to deal with missing values? Ignore the entire row Global constant Use the most common value Use average (mean, median, ) of all other values Apply machine learning techniques, eg regression, k-nn, clustering Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 41 / 63

Data Cleaning Redundant data Same data, but multiple times May lead to a bias in the (data mining) results How to deal with redundant data? Correlation and covariance analysis, Manually inspecting scatter plots Compute correlation, eg Pearson s correlation coefficient Near-duplicate detection eg for search engines detect identical versions of the same Web page Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 42 / 63

Data Cleaning Redundant data Example for Weka s visualisation Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 43 / 63

Data Cleaning Normalisation Normalisation of values, eg meta-data? Example: Normalisation of dates Many different formats: 09/11/01, 11092001, 11/09/01, eg ISO-8601: yyyy-mm-ddthh:mm:ssnnnnnn+ -hh:mm Example: Normalisation of person names {lastname}, {firstname}, {title} Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 44 / 63

Data Cleaning Noisy data Sources of noise: faulty equipment, data entry problems, inconsistencies, data transmission problems, How to deal with noisy data? Manual - sort out noisy data manually Binning - sort data and partition into bins Equal width (distance) partitioning Equal depth (frequency) partitioning Regression - fitting the data into regression functions Outlier detection Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 45 / 63

Outlier Detection Outlier Detection Filter out unwanted data Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 46 / 63

Outlier Detection What is an outlier? No universally accepted definition of an outlier (anomaly, novelty, change, deviation, surprise, ) Outlier detection = detecting patterns that do not conform to an established normal behavior (eg, rare, unexpected, ) Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 47 / 63

Outlier Detection What is an outlier? Definition #1 An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs [Grubbs1969] Definition #2 An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data [Barnett1994] Definition #3 An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism [Hawkins1980] Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 48 / 63

Outlier Detection Sample applications Intrusion detection (host-based, network-based) Fraud detection (credit cards, mobile phones, insurance claims) Medical and public health (patient records, EEG, ECG, disease outbreaks) Industrial damage detection (fault detection, structural defects) Image processing (Satellite images, robot applications, sensor systems) Outlier detection in text data etc Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 49 / 63

Outlier Detection Types of outliers Point outliers Contextual outliers Collective outliers Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 50 / 63

Outlier Detection Basic characterization of techniques I Supervised outlier detection Training data: labeled instances for both normal and outlier class size of classes is inherently unbalanced obtaining representative samples of the outlier class is challenging Semi-supervised outlier detection Training data: labeled instances for only one class (typically the normal class) construct a model corresponding to normal behavior, and test likelihood that new instances are generated by this model Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 51 / 63

Outlier Detection Basic characterization of techniques II Unsupervised outlier detection Unlabeled training data assume that the majority of the instances in the data set are normal In most applications no labels are available Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 52 / 63

Outlier Detection Classification based techniques I train a classifier that distinguishes between normal and outlier classes (multi-class vs one-class detection) Neural networks (eg, Replicator neural networks) Bayesian networks Support vector machines (eg, One-class-SVM) Rule based (eg, decision trees) Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 53 / 63

Outlier Detection Nearest-neighbor based techniques I Assume that normal data instances occur in dense neighbourhoods, while anomalies occur far from their closest neighbours (requires distance or similarity measure) Using distance to k-th nearest neighbor Outlier score = distance to k-th nearest neighbour [Ramaswamy2000] Several variants: count number of nearest neighbours within a certain distance DB(ϵ, π) [Knorr1997]: A point p is considered an outlier if less than π percent of all other points have a distance to p greater than ϵ Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 54 / 63

Outlier Detection Nearest-neighbor based techniques II Examples for cases, where naive k-nn will not work - due to differences in density or types of regularities not captured by the distance function Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 55 / 63

Outlier Detection Clustering based techniques Three categories: 1 Normal instances belong to a cluster in the data, while anomalies do not belong to any cluster cluster algorithms that do not assign every instance to a cluster (eg, DBSCAN) 2 Normal data instances lie close to their closest cluster centroid, while anomalies are far away from their closest cluster centroid Self-Organizing Maps (SOM), K-Means, Expectation Maximization, 3 Normal data instances belong to large and dense clusters, while anomalies belong to small or sparse clusters Cluster-Based Local Outlier Factor (CBLOF): compares distance to centroid with size of the cluster Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 56 / 63

Outlier Detection Statistical techniques Parametric techniques assume the knowledge of an underlying distribution (eg, Gaussian) and estimate parameters Non-parametric techniques eg, based on histograms Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 57 / 63

Outlier Detection Other techniques I Information theoretic techniques measure the complexity C(D) of a dataset D Outliers: I = arg max I D [C(D) C(D I )] Spectral techniques find a lower dimensional subspace in which normal instances and outliers appear significantly different (eg, PCA) Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 58 / 63

Outlier Detection Other techniques II Depth-based techniques Organize data into convex hull layers Points with depth k are reported as outliers Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 59 / 63

Outlier Detection Other techniques III Angle-based techniques Angle-based outlier degree (ABOD) [Kriegel2008]: ABOD(p) = VAR ( xp, yp xp 2 yp 2 ) outliers have a smaller variance more stable than distances in high dimensions Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 60 / 63

Outlier Detection Other techniques III Angle-based techniques Angle-based outlier degree (ABOD) [Kriegel2008]: ABOD(p) = VAR ( xp, yp xp 2 yp 2 ) outliers have a smaller variance more stable than distances in high dimensions Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 61 / 63

Outlier Detection Other techniques IV Grid-based subspace outlier detection [Aggarwal2001] Partition data space in equi-depth grid (Φ = number of cells in each dimension) Sparsity coefficient S(C) of a grid cell C S(C) = ) k n(c) n ( 1 Φ n ( ( 1 k Φ) 1 ( ) ) 1 k Φ n(c) number of data objects in cell C Outliers are located in cells with S(C) < 0 (n(c) is lower than expected) Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 62 / 63

Outlier Detection Thank You! Next up: Feature Extraction Further information Special thanks to Stefan Klampfl for his slides on outlier detection Roman Kern (ISDS, TU Graz) Preprocessing 2017-10-12 63 / 63