The Impacts of Data Stream Mining on Real-Time Business Intelligence

Size: px
Start display at page:

Download "The Impacts of Data Stream Mining on Real-Time Business Intelligence"

Transcription

1 The Impacts of Data Stream Mining on Real-Time Business Intelligence Yang Hang, Simon Fong Faculty of Science and Technology University of Macau, Macau SAR henry.yh.gmail.com; Abstract Real-time Business Intelligence (rt-bi) is an emerging field for business executives who need to make effective decision in a very short time. This kind of immediate real-time decisions may not necessarily be based on historical data; instead the decisions are derived from the most recent data obtained usually just minutes or seconds ago. A number of latest IT technologies are promising for rt-bi, such as realtime Data Warehouse, Complex Event Processing, real-time ETL, data stream base management systems, Stream Query Processing, and several rt-bi architectures that are available from both academic research and commercial products. One core component in the data analytic layer of typical rt-bi architecture is the data mining algorithm. Although stream data mining has been studied extensively during the last decade in algorithmic level, it has not been evaluated in relation to rt-bi. In this paper we conduct simulation experiments over traditional data mining algorithms vis-à-vis data stream mining algorithm with respect to their performance and applicability in rt-bi. Both synthetic and live data up to size of 1 6 are used in the tests. The results would be a useful reference for information technologists who want to implement rt-bi applications with the appropriate choice of mining algorithms. dozens of small tactical decisions to be made; and often they do have to decide immediately on the spot with a tight deadline. For instance, front-line operators and managers increasingly need to know what in this dynamic market or business, is happening right now, as in this second, not yesterday or even half an hour ago, in order to make an instant decision. Typical examples range from deciding whether an arriving transaction among many is fraudulent, a further inch of price could be bargained in a negotiation, to what the best deal should be offered to an online customer so that the minimum profit margin can be sustained given the current stock level and market demands - just to name a few. Rt-BI is designed to meet the requirements of supporting such time-critical decisions. Keywords-data stream mining, real-time business intelligence, performance evaluation, JAVA, WEKA, MOA I. INTRODUCTION For many years business intelligence has been used by organizations to gain insights of their business operations and thereafter improve over them. Business intelligence (BI) was formulated into strategic and tactical business plans and initiatives by usually analyzing the historical business data. A new breed of BI, namely Real-time BI (rt-bi) is emerging that has its use for managing, monitoring and optimizing daily business operations in real-time or near-real-time. Rt- BI was claimed to be the next generation BI as it is empowered by advanced predictive analytics over continuous data streams, real-time monitoring, and the speed of in-memory technology [1]. Business executives may embark on devising strategic decisions and plans only annually or quarterly, via reasoning from reports that are statistically generated from historical data. Traditional BI has been fulfilling this. However, from every now and then, the executives are actually faced with Figure 1. The value of data to two majorities of decision making [2]. Based on [2, 3, 6] who advocate the value of data to decision making declines gradually as time goes by, Figure 1 shows a curve that extends across several types of BI as the latency enlarges correspondingly. Essentially the types of BI under the curve can map to two majorities of decision making: time-critical decision based on fresh data, or traditional business intelligence that relies on stored up historical data. The main differences of these two majorities of BI include how they were used (as statistics reports or actionable information) and the timeliness of the data from which they are generated. Rt-BI is usually referred to timecritical decisions. The information from rt-bi is made available at an ultra-low latency, from the very latest data.

2 II. DATA STREAM MINING An important part of the rt-bi architecture as in Figure 2, is the automated decision making component that is usually powered by a decision tree. Decision tree (DT) is one of the most important techniques of classification and prediction in data mining. Its advantage is that tree models have a higher degree of interpretability; rules would be readily extracted from a DT and built into an automated decision maker. In this paper we generalized DT to be named as traditional decision tree (TDT) model with classical algorithms such as ID3 [12], C4.5 [13], and CART [14] etc. It was argued in [15] that TDT models are not suitable for data streams that have highly fluctuating data rates in real-time application. Therefore a new breed of data mining algorithms that fit under the domain of stream mining have been proposed in recent years to tackle the real-time and fast data streaming requirements. As compared to TDT models that assume a learning model is built upon static and well-structured data, stream mining in particular the Hoeffding Tree Algorithm (HTA) dynamically constructs a decision tree along the moving data streams. HTA is chosen to represent the DT in data stream mining because of its popularity. C4.5 is a wellknown TDT algorithm used to generate a decision tree developed by information gain [13]. The decision trees generated by C4.5 can be used for classification, which is often embedded in decision support systems. The classifier of TDT runs in two separate steps of trainthen-test. It consequently means the rules in DT model will be refreshed only when the DT is re-built with a whole set of data updated. In contrast, HTA progressively updates the rules in real-time as the DT adjusts itself when fresh data are streaming in. Readers who want more details about the operations of HTA can refer to [16] for explanation. As highlighted in [17], that it may be sufficient to use just a small available data sample for choosing the split attribute at any given node for a decision tree. This statistical method is known as Hoeffding bounds or additive Chernoff bounds, which are used to solve the difficult problem of deciding exactly how many samples are necessary at each node by using a statistical result [5, 7, 9, 1, 11, 18]. Researchers from the literature cited above attempted to innovate new algorithms that can restrict the tree size in small available memory. HTAs embedded in a rt-bi process should meet the following real-time constraints: Stationary and Un-stationary data input Limit memory space Very fast response High applicability for incomplete data With respect to real-time requirements, data stream mining poses characteristics that are more preferable than that of traditional data mining [8]. In terms of DT model construction time, TDT requires multiple scans of the whole database at intervals. As the database grows with new data continue arrive, access times to read through the database escalates proportionally and the read time will eventually become prohibited, especially in dynamic business environments where many sources of data are streaming in at high speed. The operation of a TDT such as the classical C4.5 is briefly depicted as follow (though not to scale): Figure 2. Mining step flows comparison Figure 4. Timeline of interleaving activities in C4.5 operation Figure 3. Example of DT induced with data streams The study in [15] is an excellent review that covers most of the features of data stream mining algorithms. Nevertheless, the following diagrams illustrate the step flows of the TDT and that of the HTA in stream mining. At the beginning there is an overhead of model construction in C4.5 (scans over the whole dataset) plus the time for model validation. Now the model is ready for use. Along the time, the accuracy declines as new data arrive because the model that was built upon old data falls short of catching up with the new trends in the data. The model then needs to be refreshed (or updated) with inclusion of new data. The usage periods and the refresh periods are interleaving. As time goes on and the whole data volume grows, the refresh time just stretches longer and longer such that T n > T n-1 > T 2 > T 1. Therefore the total running time for C4.5 will be: T total = T + T 1 + T 2 + T 3 where T is the overhead for the initial model building T total = T + i(t i + a i ) where i is the number of times the model needs refreshing and a i is the additional amount of time that the refresh will take. a i grows exponentially in this manner. Consequently this shows that the refresh time will grow longer in each successive step because the total data volume gets larger with new incoming data added in. Eventually C4.5 will

3 become unusable when the data size hits a limit. Thus, is stream mining a remedy as it was designed to handle data streams and suitable for very low latency rt-bi operations? How was its accuracy like when compared to traditional data mining methods e.g. C4.5? A series of experiments were conducted with the aim of verifying these. III. EXPERIMENTS A simulation system is programmed in Java language to demonstrate the differences between TDT and HTA. The representative algorithm for TDT is J48 C4.5 of which the source codes are provided by WEKA. Implementation of Hoeffding Tree algorithm is on the source codes taken from Massive Online Analysis (MOA). Both Weka and MOA are experimental packages developed by the researchers from the University of Waikato, New Zealand, who are one of the pioneers as well as authoritive providers of datamining open sources. The experiment platform is a PC with 2.99 GHz CPU and 1 GB RAM. In MOA, the data arrive continually and the total streaming data have a large size. In this case, one important factor that influences the accuracy of HTA is the data quality. The quality is controlled by the proportion of the useful data and noise data. In experiments A, B and C, we use three processed LED data streams up to one million instances per stream. The LED stream datasets 1, widely tested by other researchers, represent a classical problem of predicting the digit displayed on a 7-segment LED display where each attribute has a 1% chance of being inverted. The synthesized datasets carry 24 binary attributes; while only 7 attributes are relevant, the rest could be configured to be noises in the experiments. The datasets are mixed with %, 1% and 2% of noise data respectively. In the last experiment D we define a term called "usefulness" that is the measure of the effective accuracy of a DT model in use with new incoming data while the current model is being outdated till the next refresh. A. Tree size comparison This experiment tests the resultant decision tree sizes obtained by both C4.5 and HTA algorithms. The same LED datasets were used across the two algorithms. Up to one millions records were used for we want to observe if ever the DT sizes grows beyond the memory limits. Various percentages of noise were injected into the data because it is known that noises have adverse effects on the sizes of DTs. Experiment result shows the C4.5 tree sizes (the numbers of tree nodes) resulted from running over data of different noise levels. When the dataset is free from any noise, the tree size is a constant at 19 nodes regardless of the amount of the instances. It is well known that C4.5 is sensitive to noise data. The tree sizes increase almost linearly in proportion to the number of instances that are infested with noise. Likewise, a similar phenomenon is observed for HTA - tree size climbs up while noisy data are getting larger. One interesting observation however is that HTA shows an almost identical increase rate of tree size for noise levels 1% and 2%. Overall, the ratio of additional tree node to number of 1 instances, in the case of 2% errors is 1: / 1 5 for HTA (for every 1 5 instances, there will be increase of nodes), and the ratio for C4.5 is 1: / 1 4 (for every 1 4 instances, there will be increase of nodes ). Obviously, these rates of increase in tree sizes confirm that C4.5 is much worse than HTA by multitudes. At the point when the training data reach 7,, the tree size exceeds 14, nodes (in the case of 2% noise). For HTA, the tree size is still kept below 1 in the same situation. B. Running time comparison The computation time spent in a data mining process is one of most important factors influencing the real-time efficiency. It has a direct impact on the real-time constraints and latency requirements [4]. Although it is obvious that the time taken for data mining is dependent on the total data size, we are concerned about how the time requirements scale up on the rise of the data records. Figure 6 shows an apparent situation that: C4.5 consumes much longer time than that of HTA while the data size is growing. For example, if the timeout as required by the rt-bi is arbitrarily set as 6 seconds (return result before timeout, or else the information will be deemed worthless), to process the same set of data with 2% noise, C4.5 can only process about 5, instances (see Fig 5), while HTA can process as many as 35, (see Fig 6). Model Computation Time(s) Computation Time (s) C45 % Noise C45 1% Noise C45 2% Noise HTA % Noise HTA 1% Noise HTA 2% Noise Figure 5. C45 and HTA ComputationalTime % Noise 1% noise 2% Noise Figure 6. HTA Computational Time of Large Data Size C. Algorithm accuracy comparison In this experiment, we can observe clearly that the accuracy of C4.5 is better than that of HTA. C4.5 can achieve a perfect score by embracing the full dataset that is free of noise, in building its DT. HTA reaches about 73% accurate in doing one-pass scan as its test-and-train modeling building mechanism. The lower accuracy by HTA is also attributed to Hoeffding bound approximation.

4 Although C4.5 obtains a better performance than HTA in a relatively small dataset, what will happen if the data size grows to much bigger especially in scenario of stream mining where the data streams potentially will amount to infinity? In a subsequent experiment where C4.5 and HTA were put under the test of mining huge data records, C4.5 failed with an out-of-memory exception, and HTA survived operating normally under the same OS and hardware configurations. This again assures that C4.5 does not suit for mining too large the datasets, but HTA can do. However, the experimental data sizes used are very large from 1, to 1,, the result indicates the accuracy of HTA is accumulative, which is increasing while more and more instances are put into the calculation. However, the same experiment platform isn t possible to run C4.5 because the number of instances is too large to build a decision tree in the given memory. D. Model usefulness comparison In this set of experiments we attempt to illustrate the usefulness of C4.5 and HTA in the light of realistic operations for rt-bi. In a real-life environment, where Predictive is used as one of the components in a BI system, the sequence of the operation usually goes by first building up a decision-making model (aka DT), and then put it in use along with the incoming data (which is similar to testing for accuracy in our experiment). Decisions were made in realtime by the models, and they are supposed to be good until a while later when the model needs to be updated (rebuilt) with the inclusion of the new data. This process was already explained in Figure 5, and this experiment is set out to verify the usefulness of the data mining algorithms under such working sequence. That is different from the previous experiments in which the instances are entirely inputted to the data mining programs at one time (at each testing point along the x-axis); the divisions of the data for model training, cross-validation and testing were automatically done by the programs according to the default settings. Real data are also used in this experiment that is a large dataset of financial transactions that involve loans, credit cards, clients demographic data etc. They are collected from a Discovery Challenge hosted at PKDD 99 conference 2. The data have more than 6, instances, and over 5 attributes. In this experiment, the C4.5 is constructed such that the DT model is updated at recurring intervals when every 15, instances have arrived. As a result, there are four periods where model update took places. In each period, the first 3, data are collected for rule-building, while the other 12, data are used for prediction by the just updated decision-making model. The simulated result for the case of C4.5 is shown in Figure 7. Clearly, the established rules by the aged data fall short of accuracy for making predictions with the new coming data. This is reflected by alike declining trends over the four periods of time. Comparatively, we applied the same dataset for HTA in another experiment. 2 Accuracy 1.% 9.% 8.% 7.% 6.% 5.% 4.% 3.% 2.% 1.%.% Update Period 1 Update Period 2 Update Period 3 Update Period 4 Figure 7. The usefulness of C4.5 in real datasets The result in Figure 8 shows the performance curve is rather steady (in contrast of the down lines broken up as in C4.5) and the general accuracy is ever improving as the DT gets updated by the unique mining mechanism of HTA, each time when new data feed in. However, even with the dataset size approaches to very large, the accuracy for HTA seems to be bounded at 8% maximum. That once again validates that the accuracy of HTA is lower than that of C4.5 by a margin, even in a long run. Accuracy 1.% 9.% 8.% 7.% 6.% 5.% 4.% 3.% 2.% 1.%.% Update Period 1 Update Period 2 Update Period 3 Update Period 4 Figure 8. The usefulness of HTA in real datasets IV. CONCLUSION Business intelligence can be classified into three main types: strategic, tactical, and operational [16]. The first two deal with managing (long-term) business plans and goals based on historical data, while the last one focuses on managing and optimizing daily business operations. In operational BI, low-latency processing over business events as they happen is a critical need. In response to this kind of real-time BI, the underlying analytic mechanism must handle data streams which amounts potentially to infinity, and be able to produce a decision very quickly that comes with a reasonable accuracy. In this paper we built a JAVA simulator by importing and modifying two popular open source packages, namely WEKA and MOA, for evaluating the two algorithms C4.5 and HTA that represents traditional data mining algorithm and data stream mining algorithm respectively. Interesting properties were observed from the experiments, which are summarized in point forms as follow: HTA is able to achieve accuracy similar to C4.5's in a small fraction of the time; C4.5 can achieve a higher accuracy than HTA;

5 The accuracy of HTA is accumulative, which improves as more data arrive; C4.5's memory requirements and batch nature will not allow it to cope with data streams of large size; When the datasets are infested with noise, both algorithms suffer. But C4.5 soon runs into memory explosion with a fast growing tree in the events of noisy data. Based on the above points, HTA that represents data stream mining is a more suitable algorithm than C4.5 for rt- BI the requirements of rt-bi are met such as minimum use of memory space, fast processing time, one pass over a very huge amount of data streams, and reasonable accuracy. This paper contributes to substantiating the suitability of using data stream mining (instead of traditional data mining) for rt- BI via an empirical study. ACKNOWLEDGMENT The authors are grateful that this research project titled Real-time Data Stream Mining is supported by the Research Committee, University of Macau. Grant number: RG7/9-1S/FCC/FST. REFERENCES [1] Doug Henschen, "Next-Gen BI is Here", InformationWeekanalytics.com, White Report, Sept. 18, 29 [2] Michael J. Franklin, "Continuous analytics: data stream query processing in practice", Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems, DEBS 21, Cambridge, United Kingdom, July 12-15, 21, pp.1 [3] Judith R. Davis, "Right-Time Business Intelligence: Optimizing the Business Decision Cycle", B-EYE-Network.com, White Report, Jan. 26 [4] Yang Hang, Simon Fong, "Evaluating Hoeffding Tree Algorithm in Real-time Web Applications Environment", The 2nd International conference on IT and Business intelligence (ITBI-1), November 21, Nagpur, India, Accepted to be published. [5] Nishimura, S., Terabe, M., Hashimoto, K., and Mihara, K., "Learning Higher Accuracy Decision Trees from Concept Drifting Data Streams", In Proceedings of the 21st international Conference on industrial, Engineering and Other Applications of Applied intelligent Systems: vol Springer-Verlag, Heidelberg, 28, pp [6] Zeljko Panian, "Just-in-Time Business Intelligence and Real-Time Decisioning", Proceedings of the 9th WSEAS international conference on Applied informatics and communications, Moscow, Russia, 29, pp [7] Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby, "New Options for Hoeffding Trees", Advances in Artificial Intelligence, Springer, 27, pp [8] Yang Hang, Simon Fong, "Real-time Business Intelligence System Architecture with Stream Mining", The 5th International Conference on Digital Information Management (ICDIM 21), July 21, Thunder Bay, Canada, Accepted for Publication [9] Tao Wang, Zhoujun Li, Xiaohua Hu, Yuejin Yan, and Huowang Chen, "A New Decision Tree Classification Method for Mining High- Speed Data Streams Based on Threaded Binary Search Trees", Emerging Technologies in Knowledge Discovery and Data Mining. Springer. 29, pp [1] Gama, J., Medas, P., and Rodrigues, P., "Learning decision trees from dynamic data streams", In Proceedings of the 25 ACM Symposium on Applied Computing, ACM, New York, 25, pp [11] Hulten, G., Spencer, L., and Domingos, P., "Mining time-changing data streams", In Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, ACM, New York, 21, pp [12] Quinlan, J.R., "Induction on decision tress. Machine Learning", 1, 1986, pp [13] Quinlan, J.R., "C4.5: Programs for machine learning. Morgan Kaufmann series in machine learning", Kluwer Academic Publishers, 1993 [14] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J., "Classification and regression trees", California, USA, Wadsworth, 1984 [15] Gaber, M. M., Zaslavsky, A., and Krishnaswamy, S., "Mining data streams: a review", SIGMOD Rec. 34, 2, Jun. 25, pp [16] Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer, "MOA: Massive Online Analysis", Journal of Machine Learning Research, MIT, Volume 11, May 21, pp [17] Maron, O., and Moore, A.W., "Hoeffding races: Accelerating Model Selection Search for Classification and Function Approximation", NIPS, 1993, pp [18] Domingos, P. and Hulten, G., "Mining high-speed data streams ", In Proceedings of the Sixth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining,. ACM, New York, 2, pp. 71-8

Optimized Very Fast Decision Tree with Balanced Classification Accuracy and Compact Tree Size

Optimized Very Fast Decision Tree with Balanced Classification Accuracy and Compact Tree Size Optimized Very Fast Decision Tree with Balanced Classification Accuracy and Compact Tree Size Hang Yang, Simon Fong Faculty of Science and Technology, University of Macau Av. Padre Tomás Pereira Taipa,

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1, Albert Bifet 2, Bernhard Pfahringer 2, Geoff Holmes 2 1 Department of Signal Theory and Communications Universidad

More information

Accurate Ensembles for Data Streams: Combining Restricted Hoeffding Trees using Stacking

Accurate Ensembles for Data Streams: Combining Restricted Hoeffding Trees using Stacking JMLR: Workshop and Conference Proceedings 1: xxx-xxx ACML2010 Accurate Ensembles for Data Streams: Combining Restricted Hoeffding Trees using Stacking Albert Bifet Eibe Frank Geoffrey Holmes Bernhard Pfahringer

More information

High-Speed Data Stream Mining using VFDT

High-Speed Data Stream Mining using VFDT High-Speed Data Stream Mining using VFDT Ch.S.K.V.R.Naidu, Department of CSE, Regency Institute of Technology, Yanam, India naidu.ch@gmail.com S. Devanam Priya Department of CSE, Regency Institute of Technology,

More information

Efficient integration of data mining techniques in DBMSs

Efficient integration of data mining techniques in DBMSs Efficient integration of data mining techniques in DBMSs Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex, FRANCE {bentayeb jdarmont

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Review on Gaussian Estimation Based Decision Trees for Data Streams Mining Miss. Poonam M Jagdale 1, Asst. Prof. Devendra P Gadekar 2

Review on Gaussian Estimation Based Decision Trees for Data Streams Mining Miss. Poonam M Jagdale 1, Asst. Prof. Devendra P Gadekar 2 Review on Gaussian Estimation Based Decision Trees for Data Streams Mining Miss. Poonam M Jagdale 1, Asst. Prof. Devendra P Gadekar 2 1,2 Pune University, Pune Abstract In recent year, mining data streams

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

New ensemble methods for evolving data streams

New ensemble methods for evolving data streams New ensemble methods for evolving data streams A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia

More information

Lecture 7. Data Stream Mining. Building decision trees

Lecture 7. Data Stream Mining. Building decision trees 1 / 26 Lecture 7. Data Stream Mining. Building decision trees Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 26 1 Data Stream Mining 2 Decision Tree Learning Data Stream Mining 3

More information

REGRESSION BY SELECTING APPROPRIATE FEATURE(S)

REGRESSION BY SELECTING APPROPRIATE FEATURE(S) REGRESSION BY SELECTING APPROPRIATE FEATURE(S) 7ROJD$\GÕQDQG+$OWD\*üvenir Department of Computer Engineering Bilkent University Ankara, 06533, TURKEY Abstract. This paper describes two machine learning

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

Distribution Based Data Filtering for Financial Time Series Forecasting

Distribution Based Data Filtering for Financial Time Series Forecasting Distribution Based Data Filtering for Financial Time Series Forecasting Goce Ristanoski 1, James Bailey 1 1 The University of Melbourne, Melbourne, Australia g.ristanoski@pgrad.unimelb.edu.au, baileyj@unimelb.edu.au

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

Fuzzy Partitioning with FID3.1

Fuzzy Partitioning with FID3.1 Fuzzy Partitioning with FID3.1 Cezary Z. Janikow Dept. of Mathematics and Computer Science University of Missouri St. Louis St. Louis, Missouri 63121 janikow@umsl.edu Maciej Fajfer Institute of Computing

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Tomohiro Tanno, Kazumasa Horie, Jun Izawa, and Masahiko Morita University

More information

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University

More information

Role of big data in classification and novel class detection in data streams

Role of big data in classification and novel class detection in data streams DOI 10.1186/s40537-016-0040-9 METHODOLOGY Open Access Role of big data in classification and novel class detection in data streams M. B. Chandak * *Correspondence: hodcs@rknec.edu; chandakmb@gmail.com

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Cache Hierarchy Inspired Compression: a Novel Architecture for Data Streams

Cache Hierarchy Inspired Compression: a Novel Architecture for Data Streams Cache Hierarchy Inspired Compression: a Novel Architecture for Data Streams Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Computer Science Department University of Waikato Private Bag 315, Hamilton,

More information

Context-Aware Analytics in MOM Applications

Context-Aware Analytics in MOM Applications Context-Aware Analytics in MOM Applications Martin Ringsquandl, Steffen Lamparter, and Raffaello Lepratti Corporate Technology Siemens AG Munich, Germany martin.ringsquandl.ext@siemens.com arxiv:1412.7968v1

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Chi-Hyuck Jun *, Yun-Ju Cho, and Hyeseon Lee Department of Industrial and Management Engineering Pohang University of Science

More information

Cyber attack detection using decision tree approach

Cyber attack detection using decision tree approach Cyber attack detection using decision tree approach Amit Shinde Department of Industrial Engineering, Arizona State University,Tempe, AZ, USA {amit.shinde@asu.edu} In this information age, information

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Tutorial 1. Introduction to MOA

Tutorial 1. Introduction to MOA Tutorial 1. Introduction to MOA {M}assive {O}nline {A}nalysis Albert Bifet and Richard Kirkby March 2012 1 Getting Started This tutorial is a basic introduction to MOA. Massive Online Analysis (MOA) is

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REAL TIME DATA SEARCH OPTIMIZATION: AN OVERVIEW MS. DEEPASHRI S. KHAWASE 1, PROF.

More information

KBSVM: KMeans-based SVM for Business Intelligence

KBSVM: KMeans-based SVM for Business Intelligence Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2004 Proceedings Americas Conference on Information Systems (AMCIS) December 2004 KBSVM: KMeans-based SVM for Business Intelligence

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Improving the ROI of Your Data Warehouse

Improving the ROI of Your Data Warehouse Improving the ROI of Your Data Warehouse Many organizations are struggling with a straightforward but challenging problem: their data warehouse can t affordably house all of their data and simultaneously

More information

Efficient Data Stream Classification via Probabilistic Adaptive Windows

Efficient Data Stream Classification via Probabilistic Adaptive Windows Efficient Data Stream Classification via Probabilistic Adaptive indows ABSTRACT Albert Bifet Yahoo! Research Barcelona Barcelona, Catalonia, Spain abifet@yahoo-inc.com Bernhard Pfahringer Dept. of Computer

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

CD-MOA: Change Detection Framework for Massive Online Analysis

CD-MOA: Change Detection Framework for Massive Online Analysis CD-MOA: Change Detection Framework for Massive Online Analysis Albert Bifet 1, Jesse Read 2, Bernhard Pfahringer 3, Geoff Holmes 3, and Indrė Žliobaitė4 1 Yahoo! Research Barcelona, Spain abifet@yahoo-inc.com

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³ International Journal of Scientific & Engineering Research, Volume 7, Issue 5, May-2016 45 Classification of Big Data Stream usingensemble Classifier Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³ Abstract-

More information

Improving Range Query Performance on Historic Web Page Data

Improving Range Query Performance on Historic Web Page Data Improving Range Query Performance on Historic Web Page Data Geng LI Lab of Computer Networks and Distributed Systems, Peking University Beijing, China ligeng@net.pku.edu.cn Bo Peng Lab of Computer Networks

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM Dr. S. RAVICHANDRAN 1 E.ELAKKIYA 2 1 Head, Dept. of Computer Science, H. H. The Rajah s College, Pudukkottai, Tamil

More information

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM 1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

An Improved Document Clustering Approach Using Weighted K-Means Algorithm An Improved Document Clustering Approach Using Weighted K-Means Algorithm 1 Megha Mandloi; 2 Abhay Kothari 1 Computer Science, AITR, Indore, M.P. Pin 453771, India 2 Computer Science, AITR, Indore, M.P.

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo and M.F. Santos ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Social Behavior Prediction Through Reality Mining

Social Behavior Prediction Through Reality Mining Social Behavior Prediction Through Reality Mining Charlie Dagli, William Campbell, Clifford Weinstein Human Language Technology Group MIT Lincoln Laboratory This work was sponsored by the DDR&E / RRTO

More information

A Cloud Framework for Big Data Analytics Workflows on Azure

A Cloud Framework for Big Data Analytics Workflows on Azure A Cloud Framework for Big Data Analytics Workflows on Azure Fabrizio MAROZZO a, Domenico TALIA a,b and Paolo TRUNFIO a a DIMES, University of Calabria, Rende (CS), Italy b ICAR-CNR, Rende (CS), Italy Abstract.

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

Constraint Based Induction of Multi-Objective Regression Trees

Constraint Based Induction of Multi-Objective Regression Trees Constraint Based Induction of Multi-Objective Regression Trees Jan Struyf 1 and Sašo Džeroski 2 1 Katholieke Universiteit Leuven, Dept. of Computer Science Celestijnenlaan 200A, B-3001 Leuven, Belgium

More information

Optimizing the Revenue of Spotify with a new Pricing Scheme (MCM Problem 2)

Optimizing the Revenue of Spotify with a new Pricing Scheme (MCM Problem 2) Optimizing the Revenue of Spotify with a new Pricing Scheme (MCM Problem 2) November 4, 2018 Contents Non-technical Summary 1 1. Introduction 2 2. Assumption 3 3. Model 5 4. Data Presentation 13 5. Result

More information

Automate Transform Analyze

Automate Transform Analyze Competitive Intelligence 2.0 Turning the Web s Big Data into Big Insights Automate Transform Analyze Introduction Today, the web continues to grow at a dizzying pace. There are more than 1 billion websites

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

Adaptive Parameter-free Learning from Evolving Data Streams

Adaptive Parameter-free Learning from Evolving Data Streams Adaptive Parameter-free Learning from Evolving Data Streams Albert Bifet Ricard Gavaldà Universitat Politècnica de Catalunya { abifet, gavalda }@lsi.up.edu Abstract We propose and illustrate a method for

More information

Credit card Fraud Detection using Predictive Modeling: a Review

Credit card Fraud Detection using Predictive Modeling: a Review February 207 IJIRT Volume 3 Issue 9 ISSN: 2396002 Credit card Fraud Detection using Predictive Modeling: a Review Varre.Perantalu, K. BhargavKiran 2 PG Scholar, CSE, Vishnu Institute of Technology, Bhimavaram,

More information

DOI:: /ijarcsse/V7I1/0111

DOI:: /ijarcsse/V7I1/0111 Volume 7, Issue 1, January 2017 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey on

More information

Nigerian Telecommunications Sector

Nigerian Telecommunications Sector Nigerian Telecommunications Sector SUMMARY REPORT: Q4 and full year 2015 NATIONAL BUREAU OF STATISTICS 26th April 2016 Telecommunications Data The telecommunications data used in this report were obtained

More information

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA) International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/

More information

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

CLASSIFICATION FOR SCALING METHODS IN DATA MINING CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department

More information

Massive data mining using Bayesian approach

Massive data mining using Bayesian approach Massive data mining using Bayesian approach Prof. Dr. P K Srimani Former Director, R&D, Bangalore University, Bangalore, India. profsrimanipk@gmail.com Mrs. Malini M Patil Assistant Professor, Dept. of

More information

PASS EVALUATING IN SIMULATED SOCCER DOMAIN USING ANT-MINER ALGORITHM

PASS EVALUATING IN SIMULATED SOCCER DOMAIN USING ANT-MINER ALGORITHM PASS EVALUATING IN SIMULATED SOCCER DOMAIN USING ANT-MINER ALGORITHM Mohammad Ali Darvish Darab Qazvin Azad University Mechatronics Research Laboratory, Qazvin Azad University, Qazvin, Iran ali@armanteam.org

More information

Visual, Interactive Data Mining with InfoZoom the Financial Data Set

Visual, Interactive Data Mining with InfoZoom the Financial Data Set Contribution to the Discovery Challenge at the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 99, September 15-18, 1999, Prague, Czech Republic Visual, Interactive

More information

Stochastic propositionalization of relational data using aggregates

Stochastic propositionalization of relational data using aggregates Stochastic propositionalization of relational data using aggregates Valentin Gjorgjioski and Sašo Dzeroski Jožef Stefan Institute Abstract. The fact that data is already stored in relational databases

More information

MOA: {M}assive {O}nline {A}nalysis.

MOA: {M}assive {O}nline {A}nalysis. MOA: {M}assive {O}nline {A}nalysis. Albert Bifet Hamilton, New Zealand August 2010, Eindhoven PhD Thesis Adaptive Learning and Mining for Data Streams and Frequent Patterns Coadvisors: Ricard Gavaldà and

More information

Data mining techniques for data streams mining

Data mining techniques for data streams mining REVIEW OF COMPUTER ENGINEERING STUDIES ISSN: 2369-0755 (Print), 2369-0763 (Online) Vol. 4, No. 1, March, 2017, pp. 31-35 DOI: 10.18280/rces.040106 Licensed under CC BY-NC 4.0 A publication of IIETA http://www.iieta.org/journals/rces

More information

The Environmental Footprint of Data Centers: The Influence of Server Renewal Rates on the Overall Footprint.

The Environmental Footprint of Data Centers: The Influence of Server Renewal Rates on the Overall Footprint. The Environmental Footprint of Data Centers: The Influence of Server Renewal Rates on the Overall Footprint. Willem Vereecken 1, Ward Vanheddeghem 1, Didier Colle 1, Mario Pickavet 1, Bart Dhoedt 1 and

More information

Cost-sensitive Boosting for Concept Drift

Cost-sensitive Boosting for Concept Drift Cost-sensitive Boosting for Concept Drift Ashok Venkatesan, Narayanan C. Krishnan, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, School of Computing, Informatics and Decision Systems

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

Induction of Multivariate Decision Trees by Using Dipolar Criteria

Induction of Multivariate Decision Trees by Using Dipolar Criteria Induction of Multivariate Decision Trees by Using Dipolar Criteria Leon Bobrowski 1,2 and Marek Krȩtowski 1 1 Institute of Computer Science, Technical University of Bia lystok, Poland 2 Institute of Biocybernetics

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

Estimating Feature Discriminant Power in Decision Tree Classifiers*

Estimating Feature Discriminant Power in Decision Tree Classifiers* Estimating Feature Discriminant Power in Decision Tree Classifiers* I. Gracia 1, F. Pla 1, F. J. Ferri 2 and P. Garcia 1 1 Departament d'inform~tica. Universitat Jaume I Campus Penyeta Roja, 12071 Castell6.

More information

File Size Distribution on UNIX Systems Then and Now

File Size Distribution on UNIX Systems Then and Now File Size Distribution on UNIX Systems Then and Now Andrew S. Tanenbaum, Jorrit N. Herder*, Herbert Bos Dept. of Computer Science Vrije Universiteit Amsterdam, The Netherlands {ast@cs.vu.nl, jnherder@cs.vu.nl,

More information

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V WHITE PAPER Create the Data Center of the Future Accelerate

More information

An Optimal Regression Algorithm for Piecewise Functions Expressed as Object-Oriented Programs

An Optimal Regression Algorithm for Piecewise Functions Expressed as Object-Oriented Programs 2010 Ninth International Conference on Machine Learning and Applications An Optimal Regression Algorithm for Piecewise Functions Expressed as Object-Oriented Programs Juan Luo Department of Computer Science

More information

Learning to Choose Instance-Specific Macro Operators

Learning to Choose Instance-Specific Macro Operators Learning to Choose Instance-Specific Macro Operators Maher Alhossaini Department of Computer Science University of Toronto Abstract The acquisition and use of macro actions has been shown to be effective

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Garbage Collection (2) Advanced Operating Systems Lecture 9

Garbage Collection (2) Advanced Operating Systems Lecture 9 Garbage Collection (2) Advanced Operating Systems Lecture 9 Lecture Outline Garbage collection Generational algorithms Incremental algorithms Real-time garbage collection Practical factors 2 Object Lifetimes

More information

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on

More information

FPSMining: A Fast Algorithm for Mining User Preferences in Data Streams

FPSMining: A Fast Algorithm for Mining User Preferences in Data Streams FPSMining: A Fast Algorithm for Mining User Preferences in Data Streams Jaqueline A. J. Papini, Sandra de Amo, Allan Kardec S. Soares Federal University of Uberlândia, Brazil jaque@comp.ufu.br, deamo@ufu.br,

More information

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set Renu Vashist School of Computer Science and Engineering Shri Mata Vaishno Devi University, Katra,

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 5, May 2014, pg.923

More information

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification Adriano Veloso 1, Wagner Meira Jr 1 1 Computer Science Department Universidade Federal de Minas Gerais (UFMG) Belo Horizonte

More information

Performance Based Study of Association Rule Algorithms On Voter DB

Performance Based Study of Association Rule Algorithms On Voter DB Performance Based Study of Association Rule Algorithms On Voter DB K.Padmavathi 1, R.Aruna Kirithika 2 1 Department of BCA, St.Joseph s College, Thiruvalluvar University, Cuddalore, Tamil Nadu, India,

More information

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction

More information

Multi-Way Number Partitioning

Multi-Way Number Partitioning Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Multi-Way Number Partitioning Richard E. Korf Computer Science Department University of California,

More information

EFFICIENT ADAPTIVE PREPROCESSING WITH DIMENSIONALITY REDUCTION FOR STREAMING DATA

EFFICIENT ADAPTIVE PREPROCESSING WITH DIMENSIONALITY REDUCTION FOR STREAMING DATA INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 EFFICIENT ADAPTIVE PREPROCESSING WITH DIMENSIONALITY REDUCTION FOR STREAMING DATA Saranya Vani.M 1, Dr. S. Uma 2,

More information

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 4, April 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Discovering Knowledge

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information