Multi-task Multi-modal Models for Collective Anomaly Detection

Similar documents
CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Clustering Lecture 5: Mixture Model

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

ImageCLEF 2011

Warped Mixture Models

Expectation Maximization: Inferring model parameters and class labels

Mixture Models and the EM Algorithm

Understanding Clustering Supervising the unsupervised

Guided Image Super-Resolution: A New Technique for Photogeometric Super-Resolution in Hybrid 3-D Range Imaging

10701 Machine Learning. Clustering

What is machine learning?

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Introduction to Machine Learning CMU-10701

Inference and Representation

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Machine Learning / Jan 27, 2010

I How does the formulation (5) serve the purpose of the composite parameterization

Network Traffic Measurements and Analysis

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Expectation Maximization (EM) and Gaussian Mixture Models

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

DATA MINING II - 1DL460

K-means Clustering of Proportional Data Using L1 Distance

Expectation Maximization: Inferring model parameters and class labels

Bayesian Methods in Vision: MAP Estimation, MRFs, Optimization

Generative and discriminative classification techniques

Clustering web search results

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Fast Sample Generation with Variational Bayesian for Limited Data Hyperspectral Image Classification

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

A Taxonomy of Semi-Supervised Learning Algorithms

Random projection for non-gaussian mixture models

Learning Generative Graph Prototypes using Simplified Von Neumann Entropy

K-Means and Gaussian Mixture Models

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Image Processing. Filtering. Slide 1

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Using Machine Learning to Optimize Storage Systems

CS 229 Midterm Review

Data Mining Algorithms: Basic Methods

Grundlagen der Künstlichen Intelligenz

Latent Variable Models and Expectation Maximization

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

CS Introduction to Data Mining Instructor: Abdullah Mueen

Variational Methods for Discrete-Data Latent Gaussian Models

K-Means Clustering 3/3/17

Anomaly Detection on Data Streams with High Dimensional Data Environment

Unsupervised Learning: Clustering

Expectation Propagation

Performance Analysis of Data Mining Classification Techniques

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Region-based Segmentation and Object Detection

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Information Driven Healthcare:

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Probabilistic Graphical Models

Normalized Texture Motifs and Their Application to Statistical Object Modeling

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Scan Matching. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Clustered Compressive Sensing: Application on Medical Imaging

Missing variable problems

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Content-based image and video analysis. Machine learning

Introduction to Trajectory Clustering. By YONGLI ZHANG

Weka ( )

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

ECE 5424: Introduction to Machine Learning

A Brief Look at Optimization

Supervised vs. Unsupervised Learning

Sparse sampling in MRI: From basic theory to clinical application. R. Marc Lebel, PhD Department of Electrical Engineering Department of Radiology

Client Dependent GMM-SVM Models for Speaker Verification

Loopy Belief Propagation

MULTIVARIATE ANALYSES WITH fmri DATA

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

10-701/15-781, Fall 2006, Final

Sparsity and image processing

Adaptive Learning of an Accurate Skin-Color Model

Robust and Secure Iris Recognition

Smooth Image Segmentation by Nonparametric Bayesian Inference

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

Dependency detection with Bayesian Networks

Inversion via Bayesian Multimodal Iterative Adaptive Processing (MIAP)

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Case Study IV: Bayesian clustering of Alzheimer patients

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract

CS 6140: Machine Learning Spring 2016

Optimal Denoising of Natural Images and their Multiscale Geometry and Density

Learning a Manifold as an Atlas Supplementary Material

A General Greedy Approximation Algorithm with Applications

IMAGE RESTORATION VIA EFFICIENT GAUSSIAN MIXTURE MODEL LEARNING

Methods for Intelligent Systems

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Announcements. Image Segmentation. From images to objects. Extracting objects. Status reports next Thursday ~5min presentations in class

An Adaptive Framework for Multistream Classification

Motivation. Technical Background

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Transcription:

Multi-task Multi-modal Models for Collective Anomaly Detection Tsuyoshi Ide ( Ide-san ), Dzung T. Phan, J. Kalagnanam PhD, Senior Technical Staff Member IBM Thomas J. Watson Research Center This slides are available at ide-research.net. IEEE International Conference on Data Mining (ICDM 2017)

Outline Problem setting Modeling strategy Model inference approach Experimental results 2

IBM Research Wish to build a collective monitoring solution System 1 (in New Orleans) System s System S (in New York) You have many similar but not identical industrial assets You want to build an anomaly detection model for each of the assets Straightforward solutions have serious limitations o 1. Treat the systems separately. Create each model individually Suffers from lack of fault examples o 2. Build one universal model by disregarding individuality Model fit is not good 3

IBM Research Practical requirements: Need to capture both commonality and individuality System 1 (in New Orleans) System s Capture both individuality and commonality Automatically capture multiple operational states o Real-world is not single-peaked (single-modal) System S (in New York) Be robust to noise Be highly interpretable for diagnosis purposes 4

multi-task learning (MTL) IBM Research Formalizing the problem as multi-task density estimation for anomaly detection Data Prob. density Anomaly score System 1 (in New Orleans) all data System s overall variable-wise System S (in New York) 5

Outline Problem setting Modeling strategy Model inference approach Experimental results 6

Use Gaussian graphical model (GGM)-based anomaly detection approach as the basic building block Multi-variate data Sparse graphical model Anomaly score Overall score Variable-wise score sample covariance training data [Ide+ SDM09] [Ide+ ICDM16] 7

IBM Research Basic modeling strategy: Combine common pattern dictionary with individual weights System 1 (in New Orleans) Individual sparse weights prob. Common dictionary of sparse graphs sparse GGM 1 Monitoring model for System 1 System s prob. sparse GGM 2 Monitoring model for System 2 prob. System S (in New York) sparse GGM K GGM=Gaussian Graphical Model Monitoring model for System S 8

IBM Research Basic modeling strategy: Resulting model will be a sparse mixture of sparse GGM System s prob. sparse GGM 1 sparse GGM 2 Monitoring model for System s Gaussian mixture sparse GGM K GGM=Gaussian Graphical Model Sparse mixture weights (= automatic determination of the number of patterns) Sparse Gaussian graphical model 9

Outline Problem setting Modeling strategy Model inference approach Experimental results 10

Employing a Bayesian model for multi-modal MTL Observation model (for the s-th task) o Gaussian mixture with task-dependent weight Sparsity enforcing priors (non-conjugate) o Laplace prior for the precision matrix o Bernoulli prior for the mixture weights Conjugate prior on and 11

Maximizing log likelihood using variational Bayes combined with point-estimation Log likelihood Likelihood by the obs. model Prior distributions Use VB for Use point-estimate for o Results in two convex optimization problems 12

Maximizing log likelihood using variational Bayes combined with point-estimation Update sample weights Use new semi-closed form solution Update cluster weights Update precision matrices Update other parameters The ratio of samples assigned to the k-th cluster Solved by graphical lasso [Friedman 08] total # of samples assigned to the k-th cluster 13

Solving the L0-regularized optimization problem for mixture weights What is the problem of the conventional VB approach? o Simply differentiate w.r.t. o Claims to get a sparse solution [Corduneanu+ 01] o But mathematically cannot be zero due to logarithm We re-formalized the problem as a convex mixedinteger programming o A semi-closed form solution can be derived ( see paper) 14

Comparison with possible alternatives Interpretability Noise reduction Fleet-readiness Multi-modality Our work [Ide et al. ICDM 17] Yes Yes Yes Yes (single) sparse GGM Gaussian mixtures Multi-task sparse GGM Multi-task learning anomaly detection [Ide et al. SDM 2009, Ide et al. ICDM 2016] Yes Yes No No [Yamanishi et al., 2000; Zhang and Fung, 2013; Gao et al., 2016] [Varoquaux et al., 2010; Honorio and Samaras, 2010; Chiquet et al., 2011; Danaher et al., 2014; Gao et al., 2016; Peterson et al., 2015]. Limited Limited No Yes Yes Yes Yes No [Bahadori et al., 2011; He et al., 2014; Xiao et al., 2015] No (depends) Yes No 15

Outline Problem setting Modeling strategy Model inference approach Experimental results 16

Experiment (1): Learning sparse mixture weights Conventional ARD approach sometimes get stuck with local minima o ARD = automatic relevance determination o Often less sparser than the proposed convex L0 approach Proposed convex L0 approach gives better likelihood Conventional ARD approach gets stuck with a local minimum Typical result of log likelihood vs VB iteration count 17

Experiment (2): Learning GGMs and detecting anomalies Anuran Calls (frog voice) data in UCI Archive o Multi-modal (multi-peaked) o Voice signal + attributes (species, etc.) Created 10-variate, 3-task dataset o Use the species of Rhinellagranulosa as the anomaly Results o Two non-empty GGMs are automatically detected starting from K=9 o Clearly outperformed single-modal MTL alternative in anomaly detection Group graphical lasso, fused graphical lasso Example of variable-wise distribution Automatically learned GGMs 18

Conclusion Developed multi-task density estimation framework that can handle multi-modality o Featuring double sparsity: mixture weights, variable dependency Demonstrated the utility in the context of condition-based asset management 19

Thank you! 20

Integrated monitoring tool allows sharing rare anomaly data across different assets Anomalous: 0.2% Normal: 99.8% In condition-based monitoring, big data may not be really big o Anomalous samples account for less than 0.2% in a metal smelting process Coverage of anomalies and thus accuracy can be limited due to lack of data 21

Existing methods cannot handle multi-modality probability density Can handle multi-modality but two systems must have the same model Can treat different systems differently but cannot handle multi-modality System (task) 1 (in New Orleans) System (task) 2 (in New York) sensor variable (e.g. temperature) sensor variable (e.g. temperature) Comparing the proposed multi-task multi-modal (MTL-MM) model with standard Gaussian mixture (GMM) and multi-task learning (MTL) models 22