The Impacts of Data Stream Mining on Real-Time Business Intelligence

Size: px

Start display at page:

Download "The Impacts of Data Stream Mining on Real-Time Business Intelligence"

Harriet Mosley
6 years ago
Views:

The Impacts of Data Stream Mining on Real-Time Business Intelligence Yang Hang, Simon Fong Faculty of Science and Technology University of Macau, Macau SAR henry.yh.gmail.com; ccfong@umac.

1 The Impacts of Data Stream Mining on Real-Time Business Intelligence Yang Hang, Simon Fong Faculty of Science and Technology University of Macau, Macau SAR henry.yh.gmail.com; Abstract Real-time Business Intelligence (rt-bi) is an emerging field for business executives who need to make effective decision in a very short time. This kind of immediate real-time decisions may not necessarily be based on historical data; instead the decisions are derived from the most recent data obtained usually just minutes or seconds ago. A number of latest IT technologies are promising for rt-bi, such as realtime Data Warehouse, Complex Event Processing, real-time ETL, data stream base management systems, Stream Query Processing, and several rt-bi architectures that are available from both academic research and commercial products. One core component in the data analytic layer of typical rt-bi architecture is the data mining algorithm. Although stream data mining has been studied extensively during the last decade in algorithmic level, it has not been evaluated in relation to rt-bi. In this paper we conduct simulation experiments over traditional data mining algorithms vis-à-vis data stream mining algorithm with respect to their performance and applicability in rt-bi. Both synthetic and live data up to size of 1 6 are used in the tests. The results would be a useful reference for information technologists who want to implement rt-bi applications with the appropriate choice of mining algorithms. dozens of small tactical decisions to be made; and often they do have to decide immediately on the spot with a tight deadline. For instance, front-line operators and managers increasingly need to know what in this dynamic market or business, is happening right now, as in this second, not yesterday or even half an hour ago, in order to make an instant decision. Typical examples range from deciding whether an arriving transaction among many is fraudulent, a further inch of price could be bargained in a negotiation, to what the best deal should be offered to an online customer so that the minimum profit margin can be sustained given the current stock level and market demands - just to name a few. Rt-BI is designed to meet the requirements of supporting such time-critical decisions. Keywords-data stream mining, real-time business intelligence, performance evaluation, JAVA, WEKA, MOA I. INTRODUCTION For many years business intelligence has been used by organizations to gain insights of their business operations and thereafter improve over them. Business intelligence (BI) was formulated into strategic and tactical business plans and initiatives by usually analyzing the historical business data. A new breed of BI, namely Real-time BI (rt-bi) is emerging that has its use for managing, monitoring and optimizing daily business operations in real-time or near-real-time. Rt- BI was claimed to be the next generation BI as it is empowered by advanced predictive analytics over continuous data streams, real-time monitoring, and the speed of in-memory technology [1]. Business executives may embark on devising strategic decisions and plans only annually or quarterly, via reasoning from reports that are statistically generated from historical data. Traditional BI has been fulfilling this. However, from every now and then, the executives are actually faced with Figure 1. The value of data to two majorities of decision making [2]. Based on [2, 3, 6] who advocate the value of data to decision making declines gradually as time goes by, Figure 1 shows a curve that extends across several types of BI as the latency enlarges correspondingly. Essentially the types of BI under the curve can map to two majorities of decision making: time-critical decision based on fresh data, or traditional business intelligence that relies on stored up historical data. The main differences of these two majorities of BI include how they were used (as statistics reports or actionable information) and the timeliness of the data from which they are generated. Rt-BI is usually referred to timecritical decisions. The information from rt-bi is made available at an ultra-low latency, from the very latest data.

2 II. DATA STREAM MINING An important part of the rt-bi architecture as in Figure 2, is the automated decision making component that is usually powered by a decision tree. Decision tree (DT) is one of the most important techniques of classification and prediction in data mining. Its advantage is that tree models have a higher degree of interpretability; rules would be readily extracted from a DT and built into an automated decision maker. In this paper we generalized DT to be named as traditional decision tree (TDT) model with classical algorithms such as ID3 [12], C4.5 [13], and CART [14] etc. It was argued in [15] that TDT models are not suitable for data streams that have highly fluctuating data rates in real-time application. Therefore a new breed of data mining algorithms that fit under the domain of stream mining have been proposed in recent years to tackle the real-time and fast data streaming requirements. As compared to TDT models that assume a learning model is built upon static and well-structured data, stream mining in particular the Hoeffding Tree Algorithm (HTA) dynamically constructs a decision tree along the moving data streams. HTA is chosen to represent the DT in data stream mining because of its popularity. C4.5 is a wellknown TDT algorithm used to generate a decision tree developed by information gain [13]. The decision trees generated by C4.5 can be used for classification, which is often embedded in decision support systems. The classifier of TDT runs in two separate steps of trainthen-test. It consequently means the rules in DT model will be refreshed only when the DT is re-built with a whole set of data updated. In contrast, HTA progressively updates the rules in real-time as the DT adjusts itself when fresh data are streaming in. Readers who want more details about the operations of HTA can refer to [16] for explanation. As highlighted in [17], that it may be sufficient to use just a small available data sample for choosing the split attribute at any given node for a decision tree. This statistical method is known as Hoeffding bounds or additive Chernoff bounds, which are used to solve the difficult problem of deciding exactly how many samples are necessary at each node by using a statistical result [5, 7, 9, 1, 11, 18]. Researchers from the literature cited above attempted to innovate new algorithms that can restrict the tree size in small available memory. HTAs embedded in a rt-bi process should meet the following real-time constraints: Stationary and Un-stationary data input Limit memory space Very fast response High applicability for incomplete data With respect to real-time requirements, data stream mining poses characteristics that are more preferable than that of traditional data mining [8]. In terms of DT model construction time, TDT requires multiple scans of the whole database at intervals. As the database grows with new data continue arrive, access times to read through the database escalates proportionally and the read time will eventually become prohibited, especially in dynamic business environments where many sources of data are streaming in at high speed. The operation of a TDT such as the classical C4.5 is briefly depicted as follow (though not to scale): Figure 2. Mining step flows comparison Figure 4. Timeline of interleaving activities in C4.5 operation Figure 3. Example of DT induced with data streams The study in [15] is an excellent review that covers most of the features of data stream mining algorithms. Nevertheless, the following diagrams illustrate the step flows of the TDT and that of the HTA in stream mining. At the beginning there is an overhead of model construction in C4.5 (scans over the whole dataset) plus the time for model validation. Now the model is ready for use. Along the time, the accuracy declines as new data arrive because the model that was built upon old data falls short of catching up with the new trends in the data. The model then needs to be refreshed (or updated) with inclusion of new data. The usage periods and the refresh periods are interleaving. As time goes on and the whole data volume grows, the refresh time just stretches longer and longer such that T n > T n-1 > T 2 > T 1. Therefore the total running time for C4.5 will be: T total = T + T 1 + T 2 + T 3 where T is the overhead for the initial model building T total = T + i(t i + a i ) where i is the number of times the model needs refreshing and a i is the additional amount of time that the refresh will take. a i grows exponentially in this manner. Consequently this shows that the refresh time will grow longer in each successive step because the total data volume gets larger with new incoming data added in. Eventually C4.5 will

3 become unusable when the data size hits a limit. Thus, is stream mining a remedy as it was designed to handle data streams and suitable for very low latency rt-bi operations? How was its accuracy like when compared to traditional data mining methods e.g. C4.5? A series of experiments were conducted with the aim of verifying these. III. EXPERIMENTS A simulation system is programmed in Java language to demonstrate the differences between TDT and HTA. The representative algorithm for TDT is J48 C4.5 of which the source codes are provided by WEKA. Implementation of Hoeffding Tree algorithm is on the source codes taken from Massive Online Analysis (MOA). Both Weka and MOA are experimental packages developed by the researchers from the University of Waikato, New Zealand, who are one of the pioneers as well as authoritive providers of datamining open sources. The experiment platform is a PC with 2.99 GHz CPU and 1 GB RAM. In MOA, the data arrive continually and the total streaming data have a large size. In this case, one important factor that influences the accuracy of HTA is the data quality. The quality is controlled by the proportion of the useful data and noise data. In experiments A, B and C, we use three processed LED data streams up to one million instances per stream. The LED stream datasets 1, widely tested by other researchers, represent a classical problem of predicting the digit displayed on a 7-segment LED display where each attribute has a 1% chance of being inverted. The synthesized datasets carry 24 binary attributes; while only 7 attributes are relevant, the rest could be configured to be noises in the experiments. The datasets are mixed with %, 1% and 2% of noise data respectively. In the last experiment D we define a term called "usefulness" that is the measure of the effective accuracy of a DT model in use with new incoming data while the current model is being outdated till the next refresh. A. Tree size comparison This experiment tests the resultant decision tree sizes obtained by both C4.5 and HTA algorithms. The same LED datasets were used across the two algorithms. Up to one millions records were used for we want to observe if ever the DT sizes grows beyond the memory limits. Various percentages of noise were injected into the data because it is known that noises have adverse effects on the sizes of DTs. Experiment result shows the C4.5 tree sizes (the numbers of tree nodes) resulted from running over data of different noise levels. When the dataset is free from any noise, the tree size is a constant at 19 nodes regardless of the amount of the instances. It is well known that C4.5 is sensitive to noise data. The tree sizes increase almost linearly in proportion to the number of instances that are infested with noise. Likewise, a similar phenomenon is observed for HTA - tree size climbs up while noisy data are getting larger. One interesting observation however is that HTA shows an almost identical increase rate of tree size for noise levels 1% and 2%. Overall, the ratio of additional tree node to number of 1 instances, in the case of 2% errors is 1: / 1 5 for HTA (for every 1 5 instances, there will be increase of nodes), and the ratio for C4.5 is 1: / 1 4 (for every 1 4 instances, there will be increase of nodes ). Obviously, these rates of increase in tree sizes confirm that C4.5 is much worse than HTA by multitudes. At the point when the training data reach 7,, the tree size exceeds 14, nodes (in the case of 2% noise). For HTA, the tree size is still kept below 1 in the same situation. B. Running time comparison The computation time spent in a data mining process is one of most important factors influencing the real-time efficiency. It has a direct impact on the real-time constraints and latency requirements [4]. Although it is obvious that the time taken for data mining is dependent on the total data size, we are concerned about how the time requirements scale up on the rise of the data records. Figure 6 shows an apparent situation that: C4.5 consumes much longer time than that of HTA while the data size is growing. For example, if the timeout as required by the rt-bi is arbitrarily set as 6 seconds (return result before timeout, or else the information will be deemed worthless), to process the same set of data with 2% noise, C4.5 can only process about 5, instances (see Fig 5), while HTA can process as many as 35, (see Fig 6). Model Computation Time(s) Computation Time (s) C45 % Noise C45 1% Noise C45 2% Noise HTA % Noise HTA 1% Noise HTA 2% Noise Figure 5. C45 and HTA ComputationalTime % Noise 1% noise 2% Noise Figure 6. HTA Computational Time of Large Data Size C. Algorithm accuracy comparison In this experiment, we can observe clearly that the accuracy of C4.5 is better than that of HTA. C4.5 can achieve a perfect score by embracing the full dataset that is free of noise, in building its DT. HTA reaches about 73% accurate in doing one-pass scan as its test-and-train modeling building mechanism. The lower accuracy by HTA is also attributed to Hoeffding bound approximation.

4 Although C4.5 obtains a better performance than HTA in a relatively small dataset, what will happen if the data size grows to much bigger especially in scenario of stream mining where the data streams potentially will amount to infinity? In a subsequent experiment where C4.5 and HTA were put under the test of mining huge data records, C4.5 failed with an out-of-memory exception, and HTA survived operating normally under the same OS and hardware configurations. This again assures that C4.5 does not suit for mining too large the datasets, but HTA can do. However, the experimental data sizes used are very large from 1, to 1,, the result indicates the accuracy of HTA is accumulative, which is increasing while more and more instances are put into the calculation. However, the same experiment platform isn t possible to run C4.5 because the number of instances is too large to build a decision tree in the given memory. D. Model usefulness comparison In this set of experiments we attempt to illustrate the usefulness of C4.5 and HTA in the light of realistic operations for rt-bi. In a real-life environment, where Predictive is used as one of the components in a BI system, the sequence of the operation usually goes by first building up a decision-making model (aka DT), and then put it in use along with the incoming data (which is similar to testing for accuracy in our experiment). Decisions were made in realtime by the models, and they are supposed to be good until a while later when the model needs to be updated (rebuilt) with the inclusion of the new data. This process was already explained in Figure 5, and this experiment is set out to verify the usefulness of the data mining algorithms under such working sequence. That is different from the previous experiments in which the instances are entirely inputted to the data mining programs at one time (at each testing point along the x-axis); the divisions of the data for model training, cross-validation and testing were automatically done by the programs according to the default settings. Real data are also used in this experiment that is a large dataset of financial transactions that involve loans, credit cards, clients demographic data etc. They are collected from a Discovery Challenge hosted at PKDD 99 conference 2. The data have more than 6, instances, and over 5 attributes. In this experiment, the C4.5 is constructed such that the DT model is updated at recurring intervals when every 15, instances have arrived. As a result, there are four periods where model update took places. In each period, the first 3, data are collected for rule-building, while the other 12, data are used for prediction by the just updated decision-making model. The simulated result for the case of C4.5 is shown in Figure 7. Clearly, the established rules by the aged data fall short of accuracy for making predictions with the new coming data. This is reflected by alike declining trends over the four periods of time. Comparatively, we applied the same dataset for HTA in another experiment. 2 Accuracy 1.% 9.% 8.% 7.% 6.% 5.% 4.% 3.% 2.% 1.%.% Update Period 1 Update Period 2 Update Period 3 Update Period 4 Figure 7. The usefulness of C4.5 in real datasets The result in Figure 8 shows the performance curve is rather steady (in contrast of the down lines broken up as in C4.5) and the general accuracy is ever improving as the DT gets updated by the unique mining mechanism of HTA, each time when new data feed in. However, even with the dataset size approaches to very large, the accuracy for HTA seems to be bounded at 8% maximum. That once again validates that the accuracy of HTA is lower than that of C4.5 by a margin, even in a long run. Accuracy 1.% 9.% 8.% 7.% 6.% 5.% 4.% 3.% 2.% 1.%.% Update Period 1 Update Period 2 Update Period 3 Update Period 4 Figure 8. The usefulness of HTA in real datasets IV. CONCLUSION Business intelligence can be classified into three main types: strategic, tactical, and operational [16]. The first two deal with managing (long-term) business plans and goals based on historical data, while the last one focuses on managing and optimizing daily business operations. In operational BI, low-latency processing over business events as they happen is a critical need. In response to this kind of real-time BI, the underlying analytic mechanism must handle data streams which amounts potentially to infinity, and be able to produce a decision very quickly that comes with a reasonable accuracy. In this paper we built a JAVA simulator by importing and modifying two popular open source packages, namely WEKA and MOA, for evaluating the two algorithms C4.5 and HTA that represents traditional data mining algorithm and data stream mining algorithm respectively. Interesting properties were observed from the experiments, which are summarized in point forms as follow: HTA is able to achieve accuracy similar to C4.5's in a small fraction of the time; C4.5 can achieve a higher accuracy than HTA;

5 The accuracy of HTA is accumulative, which improves as more data arrive; C4.5's memory requirements and batch nature will not allow it to cope with data streams of large size; When the datasets are infested with noise, both algorithms suffer. But C4.5 soon runs into memory explosion with a fast growing tree in the events of noisy data. Based on the above points, HTA that represents data stream mining is a more suitable algorithm than C4.5 for rt- BI the requirements of rt-bi are met such as minimum use of memory space, fast processing time, one pass over a very huge amount of data streams, and reasonable accuracy. This paper contributes to substantiating the suitability of using data stream mining (instead of traditional data mining) for rt- BI via an empirical study. ACKNOWLEDGMENT The authors are grateful that this research project titled Real-time Data Stream Mining is supported by the Research Committee, University of Macau. Grant number: RG7/9-1S/FCC/FST. REFERENCES [1] Doug Henschen, "Next-Gen BI is Here", InformationWeekanalytics.com, White Report, Sept. 18, 29 [2] Michael J. Franklin, "Continuous analytics: data stream query processing in practice", Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems, DEBS 21, Cambridge, United Kingdom, July 12-15, 21, pp.1 [3] Judith R. Davis, "Right-Time Business Intelligence: Optimizing the Business Decision Cycle", B-EYE-Network.com, White Report, Jan. 26 [4] Yang Hang, Simon Fong, "Evaluating Hoeffding Tree Algorithm in Real-time Web Applications Environment", The 2nd International conference on IT and Business intelligence (ITBI-1), November 21, Nagpur, India, Accepted to be published. [5] Nishimura, S., Terabe, M., Hashimoto, K., and Mihara, K., "Learning Higher Accuracy Decision Trees from Concept Drifting Data Streams", In Proceedings of the 21st international Conference on industrial, Engineering and Other Applications of Applied intelligent Systems: vol Springer-Verlag, Heidelberg, 28, pp [6] Zeljko Panian, "Just-in-Time Business Intelligence and Real-Time Decisioning", Proceedings of the 9th WSEAS international conference on Applied informatics and communications, Moscow, Russia, 29, pp [7] Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby, "New Options for Hoeffding Trees", Advances in Artificial Intelligence, Springer, 27, pp [8] Yang Hang, Simon Fong, "Real-time Business Intelligence System Architecture with Stream Mining", The 5th International Conference on Digital Information Management (ICDIM 21), July 21, Thunder Bay, Canada, Accepted for Publication [9] Tao Wang, Zhoujun Li, Xiaohua Hu, Yuejin Yan, and Huowang Chen, "A New Decision Tree Classification Method for Mining High- Speed Data Streams Based on Threaded Binary Search Trees", Emerging Technologies in Knowledge Discovery and Data Mining. Springer. 29, pp [1] Gama, J., Medas, P., and Rodrigues, P., "Learning decision trees from dynamic data streams", In Proceedings of the 25 ACM Symposium on Applied Computing, ACM, New York, 25, pp [11] Hulten, G., Spencer, L., and Domingos, P., "Mining time-changing data streams", In Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, ACM, New York, 21, pp [12] Quinlan, J.R., "Induction on decision tress. Machine Learning", 1, 1986, pp [13] Quinlan, J.R., "C4.5: Programs for machine learning. Morgan Kaufmann series in machine learning", Kluwer Academic Publishers, 1993 [14] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J., "Classification and regression trees", California, USA, Wadsworth, 1984 [15] Gaber, M. M., Zaslavsky, A., and Krishnaswamy, S., "Mining data streams: a review", SIGMOD Rec. 34, 2, Jun. 25, pp [16] Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer, "MOA: Massive Online Analysis", Journal of Machine Learning Research, MIT, Volume 11, May 21, pp [17] Maron, O., and Moore, A.W., "Hoeffding races: Accelerating Model Selection Search for Classification and Function Approximation", NIPS, 1993, pp [18] Domingos, P. and Hulten, G., "Mining high-speed data streams ", In Proceedings of the Sixth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining,. ACM, New York, 2, pp. 71-8

Optimized Very Fast Decision Tree with Balanced Classification Accuracy and Compact Tree Size

Optimized Very Fast Decision Tree with Balanced Classification Accuracy and Compact Tree Size Hang Yang, Simon Fong Faculty of Science and Technology, University of Macau Av. Padre Tomás Pereira Taipa,