Advances in data stream mining

Size: px

Start display at page:

Download "Advances in data stream mining"

Angelina Gaines
6 years ago
Views:

1 Mohamed Medhat Gaber Mining data streams has been a focal point of research interest over the past decade. Hardware and software advances have contributed to the significance of this area of research by introducing faster than ever data generation. This rapidly generated data has been termed as data streams. Credit card transactions, Google searches, phone calls in a city, and many others\are typical data streams. In many important applications, it is inevitable to analyze this streaming data in real time. Traditional data mining techniques have fallen short in addressing the needs of data stream mining. Randomization, approximation, and adaptation have been used extensively in developing new techniques or adopting exiting ones to enable them to operate in a streaming environment. This paper reviews key milestones and state of the art in the data stream mining area. Future insights are also be presented. C 2011 Wiley Periodicals, Inc. How to cite this article: WIREs Data Mining Knowl Discov 2012, 2: doi: /widm.52 INTRODUCTION Data streams as a concept is defined as high-speed generated instances of data that challenge our computational systems to store, process, and reason about. 1,2 However, streaming data, if analyzed, is an important source of knowledge that enables us to take extremely important decisions in real time. The area has attracted attention of the data mining community over the last decade to develop new techniques or adopt existing ones aiming to realize the many important applications of data stream mining. Business, scientific, and security applications have been discussed extensively in the literature. 3,4 The last decade has witnessed an active research in the data stream mining. Hundreds of techniques have been proposed to address the research issues of analyzing rapidly arrived data streams in real time. Out of the large body of literature, we can identify four different categories that have contributed in shaping this area of research as follows. Other categories of techniques can be identified. For example, a large body of one-pass techniques do exist in the data stream mining literature. 1 However, the impact of the following four categories have been widely recognized. The first three categories represent approaches to building learning algorithms. On the other hand, Correspondence to: mohamed.gaber@port.ac.uk School of Computing, University of Portsmouth, Portsmouth, Hampshire, UK DOI: /widm.52 the last category contributes to a generic controller that could be used on top of any stream mining algorithm: 1. Two-phase techniques 2. Hoeffding bound-based techniques 3. Symbolic approximation-based techniques 4. Granularity-based techniques This paper will discuss the main principles behind each of the above categories and how this principle has been applied to different techniques. This discussion will be followed by presenting new directions in the area. Finally, future insights will be given. NOTABLE TECHNIQUES IN DATA STREAM MINING This section will provide a discussion of the four identified categories of data stream mining techniques listed in the introductory section. Two-Phase Techniques The two-phase techniques have been introduced by Aggarwal et al. 5 The general idea for this category of techniques is to maintain an online summary of data using what has been termed as microclusters. Microclustering has extended the data structure proposed by Zhang et al. 6 to develop the balanced iterative reducing and clustering using hierarchies (BIRCH). Volume 2, January/February

2 wires.wiley.com/widm The maintenance of the online microclusters is followed by a second phase that is done offline. This second phase differs from one technique to another according to whether the ultimate objective is running a supervised or an unsupervised technique. On the basis of the two-phase strategy, a framework for clustering data streams termed as CluStream has been proposed. 5 The proposed technique divides the clustering process into two components: online and offline. The online component stores summary statistics about the data streams and the offline one performs clustering on the summarized data according to a number of user preferences such as the time frame and the number of clusters. In an important milestone to the two-phase techniques, Aggarwal et al. 7 have proposed an extension to CluStream termed as HPStream a projected clustering for high-dimensional data streams. HPStream has outperformed CluStream in a number of case studies. The main motivation behind the development of HPStream is that CluStream has not performed effectively with high-dimensionality streaming information. Aggarwal et al. 8 have adopted the idea of microclusters introduced in CluStream 5 in on-demand classification. CluStream, as described earlier, divides the clustering process into the two components: offline and online. On-demand classification 8,9 uses clustering results to classify data using statistics of class distribution. The main motivation behind the technique is that the classification model should be used over a time period according to the application. The technique uses microclustering for each class in the data stream. This initialization is followed by a nearest neighbor classification of the unlabeled data. The microclusters are the key of the proposed technique which is the subtractive property. This property enables the extraction of the needed microclusters over the required time period. Hoeffding Bound-Based Techniques Domingos and Hulten 10,11 have proposed a generic strategy for scaling up machine learning algorithms termed very fast machine learning (VFML). This strategy depends on determining an upper bound for the learner s accuracy loss as a function in the number of examples/data records in each step of the algorithm. Hoeffding bound 12 has been the key for the development of the VFML techniques. Hence, we have coined this group of techniques as Hoeffding bound-based techniques. It states that with probability 1 δ, the true mean (r) isatleast(( r) ɛ), where ( r) is the estimated mean value expressed as: R ɛ = 2 ln(1/δ), 2n where R is the range of the estimated number and n is the number of points. This generic method has been applied to an extension of the traditional K-means clustering algorithm, VFKM, and decision tree classification, very fast decision trees (VFDT), techniques. Unlike K-median that has been used extensively in data stream clustering, the K-means algorithm computes the cluster centers by using the mean values of the data records assigned to the cluster under examination. VFKM 13 uses Hoeffding bound to determine the number of examples needed in each step of K-means algorithm. VFKM runs as a sequence of K-means executions with each run uses more data records than the previous one until the calculated statistical Hoeffding bound is satisfied. Domingos and Hulten 10,11 have developed VFDT, which is a decision tree learning system based on Hoeffding trees. It splits the tree using the current best attribute taking into consideration that the number of examples/records used satisfies the Hoeffding bound. VFDT is an extended version of Hoeffding tree algorithm that addresses the research issues of data streams. These research issues are as follows: Ties of attributes: occur when two or more attributes have close values of the splitting criteria such as information gain. High speed nature of data streams: represents an inherent feature of data streams. Bounded memory: the tree can grow till the algorithm runs out of memory. Accuracy of the output: is an issue in all data stream mining algorithms. The extension of Hoeffding trees in VFDT has been done using the following techniques: Ties of attributes have been overcome using a user-specified threshold of acceptable error measure for the output. That way the algorithm running time will be reduced and it overcomes the risk of infinite running time of the algorithm. The high speed nature of the streaming information has been addressed using batch processing. The computation of the splitting criteria is done in a batch processing rather than online processing. This significantly reduces the time of recalculating the criteria for all 80 Volume 2, January/February 2012

3 WIREs Data Mining and Knowledge Discovery the attributes with each incoming record of the stream. Bounded memory has been addressed by deactivating the least promising leaves and ignoring the poor attributes. The calculation of these poor attributes is done through the difference between the splitting criteria of the highest and lowest attributes. If the difference is greater than a prespecified value, the attribute with the lowest splitting measure will be removed from memory saving the memory of the data stream computing environment. The accuracy of the output has been taken into consideration using multiple scans over the data streams in the case of low data rates, and by using an accurate initialization of the tree using a different, more accurate technique to build an initial decision tree. All of the above improvements have been tested using synthetic data sets. The experiments have proved efficiency of these improvements. The VFDT has been extended to address the problem of concept drift in evolving data streams by Hulten et al. 11 The new framework has been termed as CVFDT. It is mainly running VFDT over fixed sliding windows in order to have the most updated classifier. The change occurs when the splitting criteria change significantly across the input attributes. It is worth pointing out here the work by Masud et al. 14 is tackling concept drifts with emerging classes. SAX-Based Techniques Symbolic ApproXimation (SAX) is a time series representation that has been introduced by Keogh and his colleagues. 15 SAX has proved to be the state-ofthe-art technique in time series representation. Time series data is a typical streaming source with a temporal dimension. In addition of being used in traditional data mining techniques such as clustering, classification, and indexing, it has achieved important breakthroughs in finding the most different subsequence in a time series termed discord 16 and the most frequent subsequence in a time series termed motif. 17 Numerous applications have used SAX representation with notable success. Some examples can be recalled here. It has been reported 16 that a premature ventricular contraction could be accurately identified using discord detection in the time series of electrocardiogram (ECG). Li and Nallela 18 have used motif discovery with SAX representation to successfully find patterns of water level. SAX follows three major steps in converting a time series from its numerical form to its symbolic form. The first step is Piecewise Aggregate Approximation (PAA). This is done by converting a time series of size n to an arbitrarily size w using the following equation: C i = w n n w i j= n w (i 1)+1 C j, where C i is the ith time point in the approximated time series. The second step is symbolic discretization. This is done via producing equal areas under the curve of the Gaussian distribution and setting respective breakpoints. Each breakpoint represents a step from one letter to another when replacing the approximated values produced by the PAA process by its approximated symbolic values. The final step uses a distance measure between each two characters that are stored in a lookup table to find out the accumulated distance between any two subsequences of times series. Granularity-Based Techniques Granularity-based approach has been introduced by Gaber et al Having noted that stream mining techniques may fall short when running on resourceconstrained devices such as smart phones and sensor nodes, the granularity-based approach works on adapting the mining techniques to change their resource consumption patterns over time according to availability of resources. Resource consumption patterns represent the change in resource consumption over a period of time which is termed as time frame. The algorithm granularity settings are the input, output, and processing settings of a mining algorithm that can vary over time to cope with the availability of resources and current data stream arrival rate. The following are definitions of each of these settings: Algorithm input granularity (AIG): AIG represents the process of changing the data stream arrival rates that feed the algorithm. Examples of techniques that could be used include sampling, load shedding, and creating data synopsis. Sampling has been the choice used in developing the granularity-based data mining techniques. Algorithm output granularity (AOG): AOG is the process of changing the output size of the algorithm in order to preserve the limited Volume 2, January/February

4 wires.wiley.com/widm memory space. We refer to this output as the number of knowledge structures. For example, we may refer to number of clusters or rules. Algorithm processing granularity (APG): APG is the process of changing the algorithm parameters in order to consume less processing power. Randomization and approximation techniques represent the strategies of APG. It should be noted that there is a collective interaction among the above three settings. AIG mainly affects the data rate and it is associated with bandwidth consumption and battery. On the other hand, AOG is associated with memory and APG is associated with processing power. However, the change in any of them affects the other resources. The process of enabling resource awareness should be very lightweight in order to be feasible in a streaming environment characterized by its scarcity of resources. Accordingly, the algorithm granularity settings only consider direct interactions. The algorithm granularity requires continuous monitoring of the computational resources. This is done over fixed time intervals/frames that we denote as TF. According to this periodic resource monitoring, the mining algorithm changes its parameters to cope with the current consumption patterns of resources. These parameters are AIG, APG, and AOG settings discussed briefly in the previous section. It has to be noted that setting the value of TF is a critical parameter for the success of the running technique. The higher the TF is, the lower the adaptation overhead will be, but at the expense of risking a high consumption of resources during the long time frame. The use of algorithm granularity as a general approach for mining data streams will require us to provide some formal definitions and notations. The following are definitions that we will use in our discussion: R: set of computational resources R = {r 1, r 2,..., r n }; TF: time interval for resource monitoring and adaptation; ALT: application lifetime; ALT : time left to last the application lifetime; NoF(r i ): number of time frames to consume the resources r i, assuming that the consumption pattern of r i will follow the same pattern of thelasttimeframe; AGP(r i ): algorithm granularity parameter that affects the resource r i. According to the above, the main rule to be used to use the algorithm granularity approach is as follows: IF ALT > NoF (r TF i ) THEN SET AGP(r i ) ELSE SET AGP(r i ) + Where AGP(r i ) + achieves higher accuracy at the expense of higher consumption of the resource r i, and AGP(r i ) achieves lower accuracy at the advantage of lower consumption of the resource r i. This simplified rule could take different forms according to the monitored resource and the algorithm granularity parameter applied to control the consumption of this resource. Interested readers are referred to Ref 19 for applying the above rule in controlling a data stream clustering algorithm termed as RA-Cluster. Interested practitioners can use the following procedure for enabling resource awareness and adaptation for their data stream mining algorithms. The procedure follows the following steps: 1. Identify the set of resources that mining algorithm will adapt accordingly (R); 2. Set the application lifetime (ALT) and time interval/frame (TF); 3. Define AGP(r i ) + and AGP(r i ) for every r i R; 4. Run the algorithm for TF; 5. Monitor the resource consumption for every r i R; 6. Apply AGP(r i ) + or AGP(r i ) to every r i R according to the ratio ALT : NoF (r TF i ) and the rule given; 7. Repeat the last three steps. Applying the above procedure is all what is needed to enable resource awareness and adaptation, using the algorithm granularity approach, to stream mining algorithms. On the basis of the Granularity-based approach, a number of data stream mining algorithms have been developed. For a complete list of techniques, the reader is advised to review the recent tutorial by Gama et al Volume 2, January/February 2012

5 WIREs Data Mining and Knowledge Discovery NEW DIRECTION IN DATA STREAM MINING Data stream mining has evolved as a new form of online data analysis that has also challenged the computational capabilities of our state-of-the-art data processing facilities. However, advances in the computational power of small computational devices including personal digital assistants (PDAs), smart phones and sensor nodes have realized an unpreceded opportunity to perform ubiquitous data stream mining. We can broadly categorize this area to mining sensor data streams and mobile data mining. Recent achievements in these areas are discussed in the following subsections. Mining Sensor Data Streams Many important applications coupled with the increase of the computational power of wirelessly connected sensor nodes have given birth to this new research direction in the data stream mining area. Mining data streams originated from sensor nodes has witnessed notable success in the last few years. Research issues associated with this area have been detailed in Ref 3. Differences between data stream mining in sensor networks and other platforms as detailed in Ref 3 are as follows: Duplication of data in densely deployment of wireless sensor networks introduces a new challenge. Multilevel data mining is important in wireless sensor networks given that individual sensors can generate local models that need to be integrated. Real-time data cleansing given that sensory streaming data is likely to be noisy. Adaptation to availability of resources is inevitable given the limited resources that each sensor node has. It is worth mentioning the success of granularity-based approach in developing stream mining techniques that are able to operate in wireless sensor networks. 20,23 The field is concerned with benefiting from the large deployment of small computational devices that are able to communicate wirelessly and have increasing sensing capabilities. This rich source of streaming data is a key to the success of many important security, scientific, and industrial applications. Examples of these applications could be found in Refs 3,4,24. Mobile Data Stream Mining The number of mobile users is in continuous increase. Mobile data mining users are not an exception. Academic prototypes such as Open Mobile Miner (OMM) 25 and commercial products such as MineFleet 26 have already found their way to users. We can date back the early start of the area of mobile data mining to MobiMine system developed by Karguta et al. 27 Although the system targets mobile brokers in the stock exchange area, the data mining process has been performed on a server conserving the scarce resources of the mobile device, a PDA in this case. Few years later, Karguta et al. 28 have developed VEDAS system for distributed data stream mining of a fleet of vehicles, analyzing both the driver s behavior and the vehicle s health. The system has used mobile devices running different data stream mining techniques. This has been a result of the advances in computational capabilities of our mobile devices. Mobility of the user, connectivity problems, and availability of computational resources are the major research issues in this promising area of research. The granularity-based approach has proved to be a successful solution when running stream mining techniques on mobile devices with limited resources. 21 The OMM tool 25 has adopted the granularity-based approach. Future Insights We can state the future directions and insights in this growing area of research: Online medical, scientific, and biological data stream mining using data generated from medical, biological instruments, and various tools employed in scientific laboratories; Hardware solutions to small devices emitting or receiving data streams in order to enable high-performance computation on small devices; Developing software architectures that serve data streaming applications; Situation aware data stream mining that recalls the models built in similar situations rather than building a new model; Online text mining for opinion discovery with the notable use of Web 2.0 technologies. Conclusion This review paper has highlighted the major strategies and techniques used in data stream mining. We have identified four categories of techniques: (1) two-phase Volume 2, January/February

6 wires.wiley.com/widm techniques, (2)Hoeffding bound-based techniques, (3) symbolic approximation-based techniques, and (4) granularity-based techniques. Details of each category have been discussed. New directions and future insights in this growing area of research have been presented. Two research directions have been discussed. The first concerns mining data originated from sensor networks. Mobile data stream mining represents the second area. Finally, future insights by the author have been enumerated giving the reader some potential direction for research. REFERENCES 1. Gaber MM, Zaslavsky A, Krishnaswamy S. Mining data streams: a review. ACM SIGMOD Rec 2005, 34: Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In: Proceedings of PODS Gama J, Gaber MM, eds. Learning from Data Streams: Processing Techniques in Sensor Networks. Springer Verlag; Ganguly A, Gama J, Omitaomu O, Gaber MM, Vatsavai RR, eds. Knowledge Discovery from Sensor Data. Berlin, Germany: CRC Press; Aggarwal CC, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: Proceedings of the 29th VLDB Conference. Berlin; 2003, ZhangT,RamakrishnanR,Livny,M.BIRCH:anefficient data clustering method for very large databases. SIGMOD Rec. New York: ACM Press; 1996, 25: Aggarwal CC, Han J, Wang J, Yu P. A framework for high dimensional projected clustering of data streams. In: Proceedings of the VLDB Conference Aggarwal CC, Han J, Wang J, Yu P. On demand classification of data streams. In: Proceedings of the ACM KDD Conference. Seattle, WA; 2004, Gaber MM, Zaslavsky A, Krishnaswamy S. A survey of classification methods in data streams. In: Aggarwal C, ed. Data Streams: Models and Algorithms. Springer Verlag; 2007, Domingos P, Hulten G. Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press; 2000, Hulten G, Spencer L, Domingos P. Mining timechanging data streams. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press; 2001, Hoeffding W. Probability inequalities for sums of bounded random variables. J Am Stat Assoc 1963, 58: Domingos P, Hulten G. A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning. Williams College, Williamstown, MA, USA, 2001, Masud MM, Gao J, Khan L, Han J, Thuraisingham BM. Integrating novel class detection with classification for concept-drifting data streams. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Bled, Slovenia, 2009, Lin J, Keogh E, Lonardi S, Chiu B. A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA; 2003, Keogh E, Lin J, Fu A. HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005). Houston, TX; 2005, Chiu B, Keogh E, Lonardi S. Probabilistic discovery of time series motifs. In: Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington D.C.; 2003, Li L, Nallela S. Probabilistic discovery of motifs in water level. In: IEEE International Conference on Information Reuse and Integration. Las Vegas, NV; 2009, Gaber MM, Yu PS. A holistic approach for resourceaware adaptive data stream mining. J New Gen Comput 2006, 25: Phung ND, Gaber MM, Röhm U. Resource-aware online data mining in wireless sensor networks. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining. IEEE Symposium Series on Computational Intelligence. Honolulu, HI; 2007, Gaber MM. Data stream mining using granularitybased approach. In: Abraham A, Hassanien A, Carvalho A, Snase V, eds. Foundations of Computational Intelligence. Vol. 6. Berlin/Heidelberg: Springer; 2009, Gama J, Gaber MM, Krishnaswamy S. Data stream mining: from theory to applications and from stationary to mobile. In: Twenty-Fifth Symposium On Applied Computing. Sierre, Switzerland. Available at: 84 Volume 2, January/February 2012

7 WIREs Data Mining and Knowledge Discovery SAC10-DS-Tutorial/Tutorial-SAC10-Final.pdf (Accessed October 20, 2011.) 23. Gaber MM, Shiddiqi AM. Distributed data stream classification for wireless sensor networks. In: Proceedings of the 2010 ACM Symposium on Applied Computing (SAC). Sierre, Switzerland: ACM Press; 2010, Gaber MM, Vatsavai R, Omitaomu O, Gama J, Chawla N, Ganguly A, eds. Knowledge Discovery from Sensor Data. Lecture Notes in Computer Science. Vol Las Vegas, Berlin, Germany, NV: Springer; Krishnaswamy S, Gaber MM, Harbach M, Hugues C, Sinha A, Gillick B, Haghighi PD, Zaslavsky A. Open Mobile Miner: a toolkit for mobile data stream mining. ACM Knowl Discov Databases Agnik. MineFleet description. Available at: (Accessed October 17, 2011.) 27. KarguptaH,ParkB,PittieS,LiuL,KushrajD,Sarkar K. MobiMine: monitoring the stock market from a PDA. ACM SIGKDD Explor 2002, 3: Kargupta H, Bhargava R, Liu K, Powers M, Blair P, Bushra S, Dull J, Sarkar K, Klein M, Vasa M, Handy D. VEDAS: a mobile and distributed data stream mining system for real-time vehicle monitoring. In: Proceedings of the SIAM International Data Mining Conference. Orlando, FL; 2004, Volume 2, January/February

Clustering from Data Streams

Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting