Clustering from Data Streams

Similar documents
Anytime Concurrent Clustering of Multiple Streams with an Indexing Tree

An Empirical Comparison of Stream Clustering Algorithms

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

CS570: Introduction to Data Mining

Clustering part II 1

A Review on Cluster Based Approach in Data Mining

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Clustering in Data Mining

Lesson 3. Prof. Enza Messina

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Learning Hierarchical Methods

Data Stream Clustering: A Survey

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Clustering Part 4 DBSCAN

University of Florida CISE department Gator Engineering. Clustering Part 4

A Framework for Clustering Massive Text and Categorical Data Streams

DATA STREAMS: MODELS AND ALGORITHMS

K-means based data stream clustering algorithm extended with no. of cluster estimation method

COMP 465: Data Mining Still More on Clustering

arxiv: v1 [cs.lg] 3 Oct 2018

Challenges in Ubiquitous Data Mining

Clustering Techniques

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2

CSE 5243 INTRO. TO DATA MINING

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

CS Introduction to Data Mining Instructor: Abdullah Mueen

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

CSE 5243 INTRO. TO DATA MINING

Advances in data stream mining

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

CSE 5243 INTRO. TO DATA MINING

Knowledge Discovery in Databases

Towards Real-Time Feature Tracking Technique using Adaptive Micro-Clusters

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Growing Hierarchical Trees for Data Stream Clustering and Visualization

Analyzing Outlier Detection Techniques with Hybrid Method

High-Dimensional Incremental Divisive Clustering under Population Drift

Knowledge Discovery in Databases II. Lecture 4: Stream clustering

Robust Clustering for Tracking Noisy Evolving Data Streams

Analysis and Extensions of Popular Clustering Algorithms

Clustering Large Dynamic Datasets Using Exemplar Points

Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

Gene Clustering & Classification

Data Stream Clustering Using Micro Clusters

An Efficient Density Based Incremental Clustering Algorithm in Data Warehousing Environment

Dynamic Clustering Of High Speed Data Streams

Lecture 7 Cluster Analysis: Part A

CS145: INTRODUCTION TO DATA MINING

Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden

E-Stream: Evolution-Based Technique for Stream Clustering

CS Data Mining Techniques Instructor: Abdullah Mueen

Mining Data Streams Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis. Ying Shen, SSE, Tongji University

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

MAIDS: Mining Alarming Incidents from Data Streams

Data Stream Clustering Algorithms: A Review

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Clustering Algorithms for Data Stream

CS570: Introduction to Data Mining

Data mining techniques for data streams mining

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

ECLT 5810 Clustering

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

A Framework for Clustering Evolving Data Streams

A Review: Techniques for Clustering of Web Usage Mining

Table Of Contents: xix Foreword to Second Edition

CHAPTER 4: CLUSTER ANALYSIS

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

Unsupervised Learning

Towards New Heterogeneous Data Stream Clustering based on Density

Machine Learning (BSMC-GA 4439) Wenke Liu

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Classification of evolving data streams with infinitely delayed labels

Comparative Study of Subspace Clustering Algorithms

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

Survey Paper on Clustering Data Streams Based on Shared Density between Micro-Clusters

Clustering in the Presence of Concept Drift

Hierarchical Clustering 4/5/17

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Kapitel 4: Clustering

ECLT 5810 Clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

C-NBC: Neighborhood-Based Clustering with Constraints

Contents. Foreword to Second Edition. Acknowledgments About the Authors

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

DESIGNING AN INTEREST SEARCH MODEL USING THE KEYWORD FROM THE CLUSTERED DATASETS

10701 Machine Learning. Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CS573 Data Privacy and Security. Li Xiong

Coresets for k-means clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

A Comparative Study of Various Clustering Algorithms in Data Mining

Clustering Large Datasets using Data Stream Clustering Techniques

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Transcription:

Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt

1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting to Change Properties of ODAC 4 References

Outline 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting to Change Properties of ODAC 4 References

Clustering What is cluster analysis? Grouping a set of data objects into a set of clusters, the intra-cluster similarity is high and the inter-cluster similarity is low The quality of a clustering result depends on both the similarity measure used The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Illustrative Example: K-means MacQueen 67: Each cluster is represented by the center of the cluster

K-Means for Streaming Data

Illustrative Example: Hierarchical Clustering Bottom-Up Initial State: Each object is a group. Iteratively join two groups in a single one. Top-Down Initial State: Single Group with all the objects. Iteratively divide each group into two groups.

Major Clustering Approaches Partitioning algorithms: Construct various partitions and then evaluate them by some criterion E.g., k-means, k-medoids, etc. Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. Often needs to integrate with other clustering methods, e.g., BIRCH Density-based: based on connectivity and density functions Finding clusters of arbitrary shapes, e.g., DBSCAN, OPTICS, etc. Grid-based: based on a multiple-level granularity structure View space as grid structures, e.g., STING, CLIQUE Model-based: find the best fit of the model to all the clusters Good for conceptual clustering, e.g., COBWEB, SOM

Learning Algorithms: Desirable Properties Processing each example: Small constant time Fixed amount of main memory Single scan of the data Without (or reduced) revisit old records. Processing examples at the speed they arrive Decision Models at anytime Ideally, produce a model equivalent to the one that would be obtained by a batch data-mining algorithm Ability to detect and react to concept drift

Clustering Data Streams New requirements in stream clustering Generate high-quality clusters in one scan High quality, efficient incremental clustering Analysis should take care of multi-dimensional space Analysis for different time granularity Tracking the evolution of clusters Clustering: A stream data reduction technique

Outline 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting to Change Properties of ODAC 4 References

Cluster Feature Vector Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny 1996 Cluster Feature Vector: CF = (N, LS, SS) N: Number of data points LS: N 1 x i SS: N 1 ( x i) 2 Constant space irrespective to the number of examples!

Micro clusters The sufficient statistics of a cluster A are CF A = (N, LS, SS). N, the number of data objects, LS, the linear sum of the data objects, SS, the sum of squared the data objects. Properties: Centroid = LS/N Radius = SS/N (LS/N) 2 Diameter = 2 N SS 2 LS 2 N (N 1)

Micro clusters Given the sufficient statistics of a cluster A, CF A = (N A, LS A, SS A ). Updates are: Incremental: a point x is added to the cluster: LS A LS A + x; SS A SS A + x 2 ; N A N A + 1 Additive: merging clusters A and B: LS C LS A + LS B ; SS C SS A + SS B ; N C N A + N B Subtractive: CF (C 1 C 2 ) = CF (C 1 ) FV (C 2 )

CluStream CluStream: A Framework for Clustering Evolving Data Streams (VLDB03) Divide the clustering process into online and offline components Online: periodically stores summary statistics about the stream data Micro-clustering: better quality than k-means Incremental, online processing and maintenance Offline: answers various user queries based on the stored summary statistics Tilted time frame work: register dynamic changes With limited overhead to achieve high efficiency, scalability, quality of results and power of evolution/change detection

CluStream: Online Phase Inputs: Maximum micro-cluster diameter D max For each x in the stream: Find the nearest micro-cluster M i IF the diameter of (M i x) < D max THEN assign x to that micro-cluster M i M i x ELSE Start a new micro-cluster based on x

Pyramidal Time Frame The micro-clusters are stored at snapshots. When should we make the snapshot? The snapshots follow a pyramidal pattern:

Analysis find the cluster structure in the current window, find the cluster structure over time ranges with granularity confined by the specification of window size and boundary, put different weights on different windows to mine various kinds of weighted cluster structures, mine the evolution of cluster structures based on the changes of their occurrences in a sequence of windows

Any Time Stream Clustering Properties of anytime algorithms Deliver a model at any time Improve the model if more time is available Model adaptation whenever an instance arrives Model refinement whenever time permits ClusTree [Kranen et al., 2011] an online component to learn micro-clusters Any variety of online components can be utilized Micro-clusters are subject to exponential aging

MOA

Outline 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting to Change Properties of ODAC 4 References

Clustering Time Series Data Streams Goal: Continuously maintain a clustering structure from evolving time series data streams. Ability to Incorporate new Information; Process new Information at the rate it is available. Ability to Detect and React to changes in the Cluster s Structure. Clustering of variables (sensors) not examples! The standard technique of transposing the working-matrix does not work: transpose is a blocking operator!

Online Divisive-Agglomerative Clustering Online Divisive-Agglomerative Clustering, Rodrigues & Gama, 2008 Goal: Continuously maintain a hierarchical cluster s structure from evolving time series data streams. Performs hierarchical clustering Continuously monitor the evolution of clusters diameters Two Operators: Splitting: expand the structure more data, more detailed clusters Merge: contract the structure reacting to changes. Splitting and agglomerative criteria are supported by a confidence level given by the Hoeffding bounds.

Main Algorithm ForEver Read Next Example For all the clusters Update the sufficient statistics Time to Time Verify Merge Clusters Verify Expand Cluster

Feeding ODAC Each example is processed once. Only sufficient statistics at leaves are updated. Sufficient Statistics: a triangular matrix of the correlations between variables in a leaf. Released when a leaf expands to a node.

Similarity Distance 1 corr(a,b) 2 Distance between time Series: rnomc(a, b) = where corr(a, b) is the Pearson Correlation coefficient: corr(a, b) = P AB n A 2 A2 n B 2 B2 n The sufficient statistics needed to compute the correlation are easily updated at each time step: A = a i, B = b i, A 2 = a 2 i, B 2 = b 2 i, P = a i b i

The Expand Operator: Expanding a Leaf Step 1 Step 2 Step 3 Find Pivots: x j, x k : d(x j, x k ) > d(a, b) a, b j, k If Splitting Criteria applies: Generate two new clusters. Each new cluster attract nearest variables.

Splitting Criteria When should we expand a leaf? Let d 1 = d(a, b) the farthest distance d 2 the second farthest distance Question: Is d 1 a stable option? what if we observe more examples? Hoeffding bound: R 2 ln(1/δ) 2n Split if d 1 d 2 > ɛ with ɛ = where R is the range of the random variable; δ is a user confidence level, and n is the number of observed data points.

Hoeffding bound Suppose we have made n independent observations of a random variable r whose range is R. The Hoeffding bound states that: With probability 1 δ The true mean of r is in the range r ± ɛ where ɛ = R 2 ln(1/δ) 2n Independent of the probability distribution generating the examples.

The Expand Operator: Expanding a Leaf Step 1 Step 2 Step 3 Find Pivots: x j, x k : d(x j, x k ) > d(a, b) a, b j, k If the Hoeffding bound applies: Generate two new clusters. Each new cluster attract nearest variables.

Multi-Time-Windows A multi-window system: each node (and leaves) receive examples from different time-windows.

The Merge Operator: Change Detection Time Series Concept Drift: Changes in the distribution generating the observations. Clustering Concept Drift Changing in the way time series correlate with each other Change in the cluster Structure.

The Merge Operator: Change Detection The Splitting Criteria guarantees that cluster s diameters monotonically decrease. Assume Clusters: c j with descendants c k and c s. If diameter(c k ) diameter(c j ) > ɛ OR diameter(c s ) diameter(c j ) > ɛ Change in the correlation structure! Merge clusters c k and c s into c j.

Properties of ODAC For stationary data the cluster s diameters monotonically decrease. Constant update time/memory consumption with respect to the number of examples! Every time a split is reported the time to process the next example decreases, and the space used by the new leaves is less than that used by the parent. (a + b) 2 > a 2 + b 2

The Electrical Load Demand Problem

The Electrical Load Demand Problem

Evolution of Processing Speed

Evolution of Memory Usage

Outline 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting to Change Properties of ODAC 4 References

Master References J. Gama, Knowledge Discovery from Data Streams, CRC Press, 2010. S. Muthukrishnan Data Streams: Algorithms and Applications, Foundations & Trends in Theoretical Computer Science, 2005. C. Aggarwal (Ed) Data Streams: Models and Algorithms, Springer, 2007 B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems, Proceedings of PODS, 2002. Gaber, M, M., Zaslavsky, A., and Krishnaswamy, S., Mining Data Streams: A Review, in ACM SIGMOD Record, Vol. 34, No. 1, 2005. J. Silva, E. Faria, R. Barros, E. Hruschka, A. Carvalho, J. Gama: Data stream clustering: A survey. ACM Comput. Surv. 46(1): 13 (2013)

Bibliography on Clustering P. Rodrigues, J. Gama and J. P. Pedroso. Hierarchical Clustering of Time Series Data Streams; TKDE, 2008. Zhang, Ramakrishnan, Livny; Birch: Balanced Iterative Reducing and Clustering using Hierarchies, SIGMOD 1996. Charu Aggarwal, Jiawei Han, J. Wang, P. S. Yu, A Framework for Clustering Evolving Data Streams, by VLDB 2003. G. Cormode, S. Muthukrishnan, and W. Zhuang, Conquering the divide: Continuous clustering of distributed data streams. ICDE 2007. Kranen, Assent, Baldauf, and Seidl; The ClusTree: indexing micro-clusters for anytime stream mining, Knowl. Inf. Syst. 29, 2 2011 Cormode, Muthu, Zhuang; Conquering the Divide: Continuous Clustering of Distributed Data Streams. ICDE 2007 P. Rodrigues, J. Gama: Clustering Distributed Sensor Data Streams. ECML/PKDD 2008 P. Rodrigues, J. Gama: L2GClust: local-to-global clustering of stream sources. SAC 2011