DATA STREAMS: MODELS AND ALGORITHMS

Similar documents
Data Streams Models and Algorithms

Data Streams Models and Algorithms

Clustering from Data Streams

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Data mining techniques for data streams mining

Managing and Mining Graph Data

An Algorithm for Frequent Pattern Mining Based On Apriori

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Mining Data Streams Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Frequent Pattern Mining in Data Streams. Raymond Martin

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

gsketch: On Query Estimation in Graph Streams

Dynamic Data in terms of Data Mining Streams

On Biased Reservoir Sampling in the Presence of Stream Evolution

Random Sampling over Data Streams for Sequential Pattern Mining

Data Mining: Principles and Algorithms Mining Data Streams

Table Of Contents: xix Foreword to Second Edition

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Chapter 1, Introduction

Frequent Patterns mining in time-sensitive Data Stream

Real Time Processing of Data from Patient Biodevices

Data Stream Clustering Using Micro Clusters

A Wireless Data Stream Mining Model Mohamed Medhat Gaber 1, Shonali Krishnaswamy 1, and Arkady Zaslavsky 1

A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Noval Stream Data Mining Framework under the Background of Big Data

MAIDS: Mining Alarming Incidents from Data Streams

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.

Data Mining: Concepts and Techniques. Chapter Mining data streams

Contents. Preface to the Second Edition

Research issues in Outlier mining: High- Dimensional Stream Data

Towards New Heterogeneous Data Stream Clustering based on Density

2. Data Preprocessing

Data Mining. Jeff M. Phillips. January 9, 2013

DOI:: /ijarcsse/V7I1/0111

Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden

arxiv: v1 [cs.lg] 3 Oct 2018

Quotient Cube: How to Summarize the Semantics of a Data Cube

DYNAMMO: MINING AND SUMMARIZATION OF COEVOLVING SEQUENCES WITH MISSING VALUES

Data Mining. Jeff M. Phillips. January 8, 2014

Managing and mining (streaming) sensor data

Frequent Pattern Mining with Uncertain Data

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

A Framework for Clustering Massive Text and Categorical Data Streams

CLASSIFICATION AND CHANGE DETECTION

3. Data Preprocessing. 3.1 Introduction

On Dense Pattern Mining in Graph Streams

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Preprocessing. Slides by: Shree Jaswal

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

Fundamentals of Digital Image Processing

COURSE PLAN. Computer Science & Engineering

CS570: Introduction to Data Mining

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Load Shedding in Classifying Multi-Source Streaming Data: A Bayes Risk Approach

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Probabilistic Graph Summarization

Name of the lecturer Doç. Dr. Selma Ayşe ÖZEL

Image Analysis, Classification and Change Detection in Remote Sensing

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Computer Department, Savitribai Phule Pune University, Nashik, Maharashtra, India

City, University of London Institutional Repository

An Empirical Comparison of Stream Clustering Algorithms

Machine Learning in Action

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

Distribution Based Data Filtering for Financial Time Series Forecasting

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Elysium Technologies Private Limited::IEEE Final year Project

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE)

COMP 465: Data Mining Still More on Clustering


UNIT 2 Data Preprocessing

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Data Mining: Concepts and Techniques. Chap 8. Data Streams, Time Series Data, and. Sequential Patterns. Li Xiong

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Sampling for Sequential Pattern Mining: From Static Databases to Data Streams

EFFICIENT ADAPTIVE PREPROCESSING WITH DIMENSIONALITY REDUCTION FOR STREAMING DATA

Distributed Pattern Discovery in Multiple Streams

Privacy Preserving based on Random Projection using Data Perturbation Technique

A Data Clustering Using Modified Principal Component Analysis with Genetic Algorithm

K-means based data stream clustering algorithm extended with no. of cluster estimation method

Database Supports for Efficient Frequent Pattern Mining

Digital Image Processing

Appropriate Item Partition for Improving the Mining Performance

An Improved Apriori Algorithm for Association Rules

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Temporal Weighted Association Rule Mining for Classification

Transcription:

DATA STREAMS: MODELS AND ALGORITHMS

DATA STREAMS: MODELS AND ALGORITHMS Edited by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 Kluwer Academic Publishers Boston/Dordrecht/London

Contents List of Figures List of Tables Preface xi xv xvii 1 An Introduction to Data Streams 1 Charu C. Aggarwal 1. Introduction 1 2. Stream Mining Algorithms 2 3. Conclusions and Summary 6 References 7 2 On Clustering Massive Data Streams: A Summarization Paradigm 9 Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu 1. Introduction 10 2. The Micro-clustering Based Stream Mining Framework 12 3. Clustering Evolving Data Streams: A Micro-clustering Approach 17 3.1 Micro-clustering Challenges 18 3.2 Online Micro-cluster Maintenance: The CluStream Algorithm 19 3.3 High Dimensional Projected Stream Clustering 22 4. Classification of Data Streams: A Micro-clustering Approach 23 4.1 On-Demand Stream Classification 24 5. Other Applications of Micro-clustering and Research Directions 26 6. Performance Study and Experimental Results 27 7. Discussion 36 References 36 3 A Survey of Classification Methods in Data Streams 39 Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy 1. Introduction 39 2. Research Issues 41 3. Solution Approaches 43 4. Classification Techniques 44 4.1 Ensemble Based Classification 45 4.2 Very Fast Decision Trees (VFDT) 46

vi DATA STREAMS: MODELS AND ALGORITHMS 4.3 On Demand Classification 48 4.4 Online Information Network (OLIN) 48 4.5 LWClass Algorithm 49 4.6 ANNCAD Algorithm 51 4.7 SCALLOP Algorithm 51 5. Summary 52 References 53 4 Frequent Pattern Mining in Data Streams 61 Ruoming Jin and Gagan Agrawal 1. Introduction 61 2. Overview 62 3. New Algorithm 67 4. Work on Other Related Problems 79 5. Conclusions and Future Directions 80 References 81 5 A Survey of Change Diagnosis Algorithms in Evolving Data Streams 85 Charu C. Aggarwal 1. Introduction 85 2. The Velocity Density Method 88 2.1 Spatial Velocity Profiles 93 2.2 Evolution Computations in High Dimensional Case 96 2.3 On the use of clustering for characterizing stream evolution 96 3. On the Effect of Evolution in Data Mining Algorithms 97 4. Conclusions 100 References 101 6 Multi-Dimensional Analysis of Data Streams Using Stream Cubes 103 Jiawei Han, Y. Dora Cai, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W. Wah, and Jianyong Wang 1. Introduction 104 2. Problem Definition 106 3. Architecture for On-line Analysis of Data Streams 108 3.1 Tilted time frame 108 3.2 Critical layers 110 3.3 Partial materialization of stream cube 111 4. Stream Data Cube Computation 112 4.1 Algorithms for cube computation 115 5. Performance Study 117 6. Related Work 120 7. Possible Extensions 121 8. Conclusions 122 References 123 7 Load Shedding in Data Stream Systems 127

Contents vii Brian Babcock, Mayur Datar and Rajeev Motwani 1. Load Shedding for Aggregation Queries 128 1.1 Problem Formulation 129 1.2 Load Shedding Algorithm 133 1.3 Extensions 141 2. Load Shedding in Aurora 142 3. Load Shedding for Sliding Window Joins 144 4. Load Shedding for Classification Queries 145 5. Summary 146 References 146 8 The Sliding-Window Computation Model and Results 149 Mayur Datar and Rajeev Motwani 0.1 Motivation and Road Map 150 1. A Solution to the BasicCounting Problem 152 1.1 The Approximation Scheme 154 2. Space Lower Bound for BasicCounting Problem 157 3. Beyond 0 s and 1 s 158 4. References and Related Work 163 5. Conclusion 164 References 166 9 A Survey of Synopsis Construction in Data Streams 169 Charu C. Aggarwal, Philip S. Yu 1. Introduction 169 2. Sampling Methods 172 2.1 Random Sampling with a Reservoir 174 2.2 Concise Sampling 176 3. Wavelets 177 3.1 Recent Research on Wavelet Decomposition in Data Streams 182 4. Sketches 184 4.1 Fixed Window Sketches for Massive Time Series 185 4.2 Variable Window Sketches of Massive Time Series 185 4.3 Sketches and their applications in Data Streams 186 4.4 Sketches with p-stable distributions 190 4.5 The Count-Min Sketch 191 4.6 Related Counting Methods: Hash Functions for Determining Distinct Elements 193 4.7 Advantages and Limitations of Sketch Based Methods 194 5. Histograms 196 5.1 One Pass Construction of Equi-depth Histograms 198 5.2 Constructing V-Optimal Histograms 198 5.3 Wavelet Based Histograms for Query Answering 199 5.4 Sketch Based Methods for Multi-dimensional Histograms 200 6. Discussion and Challenges 200 References 202

viii DATA STREAMS: MODELS AND ALGORITHMS 10 A Survey of Join Processing in Data Streams 209 Junyi Xie and Jun Yang 1. Introduction 209 2. Model and Semantics 210 3. State Management for Stream Joins 213 3.1 Exploiting Constraints 214 3.2 Exploiting Statistical Properties 215 4. Fundamental Algorithms for Stream Join Processing 225 5. Optimizing Stream Joins 227 6. Conclusion 230 Acknowledgments 231 References 231 11 Indexing and Querying Data Streams 237 Ahmet Bulut, Ambuj K. Singh 1. Introduction 238 2. Indexing Streams 239 2.1 Preliminaries and definitions 239 2.2 Feature extraction 240 2.3 Index maintenance 244 2.4 Discrete Wavelet Transform 244 3. Querying Streams 248 3.1 Monitoring an aggregate query 248 3.2 Monitoring a pattern query 251 3.3 Monitoring a correlation query 252 4. Related Work 254 5. Future Directions 255 5.1 Distributed monitoring systems 255 5.2 Probabilistic modeling of sensor networks 256 5.3 Content distribution networks 256 6. Chapter Summary 257 References 257 12 Dimensionality Reduction and Forecasting on Streams 261 Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos 1. Related work 264 2. Principal component analysis (PCA) 265 3. Auto-regressive models and recursive least squares 267 4. MUSCLES 269 5. Tracking correlations and hidden variables: SPIRIT 271 6. Putting SPIRIT to work 276 7. Experimental case studies 278 8. Performance and accuracy 283 9. Conclusion 286 Acknowledgments 286

Contents ix References 287 13 A Survey of Distributed Mining of Data Streams 289 Srinivasan Parthasarathy, Amol Ghoting and Matthew Eric Otey 1. Introduction 289 2. Outlier and Anomaly Detection 291 3. Clustering 295 4. Frequent itemset mining 296 5. Classification 297 6. Summarization 298 7. Mining Distributed Data Streams in Resource Constrained Environments 299 8. Systems Support 300 References 304 14 Algorithms for Distributed Data Stream Mining 309 Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar, Hillol Kargupta, Ran Wolff and Rong Chen 1. Introduction 310 2. Motivation: Why Distributed Data Stream Mining? 311 3. Existing Distributed Data Stream Mining Algorithms 312 4. A local algorithm for distributed data stream mining 315 4.1 Local Algorithms : definition 315 4.2 Algorithm details 316 4.3 Experimental results 318 4.4 Modifications and extensions 320 5. Bayesian Network Learning from Distributed Data Streams 321 5.1 Distributed Bayesian Network Learning Algorithm 322 5.2 Selection of samples for transmission to global site 323 5.3 Online Distributed Bayesian Network Learning 324 5.4 Experimental Results 326 6. Conclusion 326 References 329 15 A Survey of Stream Processing 333 Problems and Techniques in Sensor Networks Sharmila Subramaniam, Dimitrios Gunopulos 1. Challenges 334 2. The Data Collection Model 335 3. Data Communication 335 4. Query Processing 337 4.1 Aggregate Queries 338 4.2 Join Queries 340 4.3 Top-k Monitoring 341

x DATA STREAMS: MODELS AND ALGORITHMS 4.4 Continuous Queries 341 5. Compression and Modeling 342 5.1 Data Distribution Modeling 343 5.2 Outlier Detection 344 6. Application: Tracking of Objects using Sensor Networks 345 7. Summary 347 References 348 Index 353