Massive data mining using Bayesian approach

Size: px

Start display at page:

Download "Massive data mining using Bayesian approach"

Milo Williams
5 years ago
Views:

1 Massive data mining using Bayesian approach Prof. Dr. P K Srimani Former Director, R&D, Bangalore University, Bangalore, India. profsrimanipk@gmail.com Mrs. Malini M Patil Assistant Professor, Dept. of ISE, JSS Academy of Technical Education, Bangalore, India. Patilmalini31@gmail.com Abstract The advancement of the technology has led the large flow of data in the digital form. In data mining this data is typically processed as large but static dataset. Data sets which continuously and rapidly grow over time are referred to as data streams. They are referred as dynamic data streams. Few examples are network monitoring data, sensor data, click streams in search engines etc., But in every case the traditional data mining approach does not address the problem of a dynamic data streams. Data streams can be mined only using sophisticated techniques. In a data stream model the data arrive at a very high speed and the algorithm must process the data stream under very strict constraints of space and time. Massive Online Analysis (MOA) is a frame work used to mine data streams. The paper aims at understanding the performance of naive bayes algorithm on data stream generators available in massive online analysis framework. Keywords- Data streams, Naive Bays, Massive online Analysis, Data set generators, Massive Data Mining. I. INTRODUCTION Today s era of technology has resulted in the massive increase of data generation which has become an automated process. This is mainly because of different mobile applications, sensor applications, measurements in network traffic monitoring and management, log records and click streams in search engines, web logs, s, blogs, twitter posts etc. This kind of data generated can be considered as a streaming data since it is obtained from an interval of time. Thus a data stream is defined as an ordered sequence of items that arrive in timely order[1]. Data streams are different from data in traditional databases. They are continuous, unbounded, usually come in high speed and have a data distribution which often changes with time [2]. In traditional data mining the databases referred can range in huge sizes (GB,TB..etc) or more than that. Within these large databases, there lies a hidden information of strategic importance which is discovered through Data mining(dm). It is concerned with the analysis of data and the use of the software techniques for finding patterns, regularities in the sets of data. The computational techniques are responsible for finding the patterns, which are previously unknown, presently useful for future analysis. DM is an integral part of Knowledge Discovery in Databases (KDD), which is the overall process of converting raw data into useful and structured information. The KDD process comprises of six phases, Viz., data selection, data cleaning, data enrichment, data transformation or encoding, data mining, reporting and display of the discovered information. Many organizations worldwide are already using DM techniques to explore the hidden useful information from the respective databases. DM focuses on different ideas such as sampling, estimation, hypothesis testing from statistics, search algorithms, modeling techniques machine learning theories from artificial intelligence, pattern recognition and machine learning and hi-performance computing,. Thus, data mining is represented as a confluence of many disciplines. The advancement of technology has resulted in the evolution of different techniques in the area of DM. New research findings resulted in new issues in each technique. To quote some; Association rule mining, Classification, Clustering. etc. The paper is organized as follows: Section 2 mainly discusses on the need and importance of the problem; Section 3 about the related work; Section 4 discusses about methodology; Experiments and results are presented in Section 5; Finally, section 6 is about conclusions and future work. II. NEED AND IMPORTANCE OF THE PROBLEM The following important challenges pertaining to need and importance of the mining data streams. Speed: High speed is one of the inherent characteristic of data streams. The algorithm developed must be capable of handling the high speed. The rate of building the data stream model must be faster than the data rate. Memory: The classification technique needs that the data to be resided in the memory for building the model. The huge amounts of data streams generated needs unbounded memory. Concept drift: In the real world concepts are often not stable but change with time. Weather fore casting data is the best example here, which will lead to the change in the data distribution also. Often these changes make the model built on old data inconsistent with the new data and regular updating of the model is necessary. This problem is known as concept drift. This complicates the task of learning a model from data and requires special sophisticated approaches. 27 P a g e

2 Trade off between accuracy and efficiency: The main trade off in data stream mining algorithms is between the accuracy of the output with regard to the application and the time and space complexities. Sophisticated methods are essential to handle such tradeoffs. Visualization of Data stream mining results: Though visualization of traditional data mining results on a desktop has been a research issue for more than a decade. Visualization of data stream mining results also is equally challenging. Modeling change of mining results over time: In some cases the user is not interested in mining data steam results, but is interested in knowing how these results are changing over a temporal basis. The classification changes could help in understanding the change in data streams. Interactive mining environment: Mining data streams is a highly application oriented field. For example. The environment must support the user to change the classification parameters with respect to the current context. The fast nature of data streams often makes it more difficult to incorporate user interaction. Integration of data stream management systems and data stream mining approaches: The integration among storage, querying, mining and reasoning of the incoming stream would realize robust streaming systems that could be used in different applications. Technology: Issues related to technology are one of the important challenges in mining data streams. viz., How to handle such large data with respect storage; the How to represent the data in such an environment in a compressed way; which platforms are best suited; what type of hardware is suitable; and How to handle the complex computations. III. RELATED WORK The techniques of data mining are exhaustively presented in [3, 4, 5, and 6]. A typical data set is taken Technical Education System (TES) pertaining to one organization. Knowledge discovered helps TES to take useful decisions for maintaining the quality of the education system. The results of the exhaustive research work [3, 4, 5, and 6] are highly effective in taking optimal decisions at the managerial level. A model [7] was proposed for the first time, using medical data stream with regard to Super Specialty Health Care Unit(SSHCU). The method constitutes the use of land mark window model and K- means clustering technique to generate the clusters. Massive Data Mining which is a technique used to mine the data streams was proposed by the authors in [8]. Classification technique is used to mine the data streams. In [9] the authors mainly worked on comparison of the traditional DM techniques and data stream mining techniques. In [10,11, 12] the authors presented a framework for stream classification and clustering which is referred as massive online analysis framework. Therefore, the present work is carried out to throw light on the qualitative as well as the quantitative aspects of the problem. IV. PRILIMINARIES Classification of large data sets is supervised learning problem technique used to solve statistics and machine learning problems in order to extract rules and patterns from data that are used for prediction. In [18] the author explained the different techniques of classification, prediction and regression techniques along with the state of art of algorithms. Different types of classification techniques include Decision Tree Induction, Rule Based Classifier, Nearest Neighbor Classifiers, Statistical methods, Neural Network Approaches. The objective in classification is to build a mapping function that assigns class labels to each new instance or to verify the appropriateness of class labels already assigned. Mathematically classification is defined as follows: Given a data base X = { x 1, x 2, x 3..x n } of tuples (items and records) and a set of classes Y = {y 1, y 2, y 3.y m }. Classification is the task of learning a target function f : x y that maps each attribute set x to one of the predefined class labels y. Thus Classification is the task of mapping an input attribute X into its class label Y. The general approach for solving classification problems is shown in Fig 2. First, a training set consisting of records whose class labels are known must be provided. The training set is used to build a classification model, which is subsequently applied to the test set which consists of records with unknown class labels. Figure 1 General approach for solving classification problems. V. METHODOLOGY FOR MASSIVE DATA MINING From the literature survey it is impractical to scan through an entire data stream more than once [18]. The huge size of such data sets also implies that, generally it is not possible to store the entire stream data set in main memory or even on disk. For effective processing of stream data, new data structures, techniques, and algorithms are needed. Because infinite amount of space to store stream data is not available, often there is a tradeoff between accuracy and storage. From the algorithmic point of view, it is required that the algorithms are to be efficient in both space and time. Instead of storing all or most elements seen so far, using O(N) space, it is optimal to use poly logarithmic space, O (logk N), where N is the number of elements in the stream data. Massive Online Analysis (MOA) is a software framework for implementing algorithms and running experiments for online learning from evolving data streams. MOA [10,11,12] is designed with the 28 P a g e

challenging problems of scaling up the implementation of the algorithms to real world data set sizes and making algorithms comparable in bench mark streaming settings.

It is a software environment for implementing algorithms and running experiments for online learning from evolving data streams.

3 challenging problems of scaling up the implementation of the algorithms to real world data set sizes and making algorithms comparable in bench mark streaming settings. Different steps used in the methodology are presented below. A. Massive data mining (MDM) Massive data mining is performed using Massive online analysis (MOA) framework [13,14,15]. It is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed in such a way that it can handle the challenging problem of scaling up the implementation of state of the art algorithms to real world data sets. It consists of offline and online algorithms for classification and clustering. It also consists of tools for evaluation. Thus MOA is an open source frame work to handle massive, potentially infinite, evolving data streams. MOA mainly permits the evaluation of data stream learning algorithms on large streams under explicit memory limits. The method MDM mainly consists of the following steps. namely., i) Select the task ii)select the learner iii) Select the stream generator. iv) Select the evaluator The model is configured with these four steps. And results are noted and found very interesting. An initial configuration model is shown in the fig.2. can be used to test the model before it is used for training and accordingly the accuracy can be incrementally updated. C. Performance Evaluators in MOA: MOA consists of four different performance evaluators to evaluate the performance of the algorithm. mainly., Windows Classification Performance Evaluator(WCPE), Basic classification Performance Evaluator(BCPE), EWMA classification Performance Evaluator, Fading Factor classification Performance Evaluator. Present work used only WCPE. D. Algorithm used in the analysis: Naïve bayes(nb) The algorithm performs classic bayesian prediction while making naive assumption that all inputs are independent[17]. Naïve Bayes is a classifier algorithm known for its simplicity and low computational cost. Given nc different classes, the trained Naïve Bayes classifier predicts for every unlabelled instance I the class C to which it belongs with high accuracy. The model works as follows: Let x1,..., xk be k discrete attributes, and assume that xi can take ni different values. Let C be the class attribute, which can take nc different values. Upon receiving an unlabelled instance I = (x1 = v1,..., xk = vk), the Naïve Bayes classifier computes a probability of being in class c as: Fig. 2. General Configuration Model for classification in MOA B. Evaluation process in MOA There are two options in the case of the evaluation process of MOA.Viz., Holdout and Prequential. The first case is suitable when the division between train and test sets is predefined so that the results from different studies could be directly compared. In the second case each individual example The values Pr[xi = vj ^C = c] and Pr[C = c] are estimated from the training data. Thus, the summary of the training data is simply a 3-dimensional table that stores for each triple (xi, vj, c) a count Ni, j, c of training instances with xi = vj, together with a 1-dimensional table for the counts of C = c. This algorithm is naturally incremental: upon receiving a new example (or a batch of new examples), simply increment the relevant counts. Predictions can be made at any time from the current counts. E. Data streams used in the analysis From the literature survey it is found that there is a scarcity of data sources. The present work is carried out on the data stream generators available in MOA framework. The following are the eight main data stream generators used in the present investigation. They are explained as follows. RANDOMTREE-Generator Generates a stream based on a randomly generated tree contributed by [16]. It constructs a decision tree by choosing attributes at random to split, and assigning a random class label to each leaf. Once the tree is built, new examples are 29 P a g e

4 generated by assigning uniformly distributed random values to attributes which then determine the class label via the tree. The generator has parameters to control the number of classes, attributes, nominal attribute labels, and the depth of the tree. A degree of noise can be introduced to the examples after generation. In the case of discrete attributes and the class label, a probability of noise parameter determines the chance that any particular value is switched to something other than the original value. For numeric attributes, a degree of random noise is added to all values, drawn from a random Gaussian distribution with standard deviation equal to the standard deviation of the original values multiplied by noise probability. RANDOMRBF-Generator Generates a random radial basis function stream found in [15]. This generator was devised to offer an alternate complex concept type that is not straightforward to approximate with a decision tree model. The RBF (Radial Basis Function) generator works as follows: A fixed number of random centroids are generated. Each centre has a random position, a single standard deviation, class label and weight. New examples are generated by selecting a centre at random, taking weights into consideration so that centers with higher weight are more likely to be chosen. A random direction is chosen to offset the attribute values from the central point. The length of the displacement is randomly drawn from a Gaussian distribution with standard deviation determined by the chosen centroid. The chosen centroid also determines the class label of the example. This effectively creates a normally distributed hyper sphere of examples surrounding each central point with varying densities. Only numeric attributes are generated. SEA-Generator A streaming ensemble algorithm (SEA) is used for largescale classification to generate SEA concepts functions. This dataset contains abrupt concept drift, first introduced by [18]. It is generated using three attributes, where only the two first attributes are relevant. All three attributes have values between 0 and 10. The points of the dataset are divided into 4 blocks with different concepts. In each block, the classification is done using f 1 + f 2 δ, where f 1 and f 2 represent the first two attributes and δ is a threshold value. The most frequent values are 9, 8, 7 and 9.5 for the data blocks. STAGGER-Generator Generates STAGGER Concept functions. The function uses the incremental learning method from noisy data. This generator was introduced by [19]. The STAGGER Concepts are boolean functions of three attributes encoding objects: size (small, medium, and large), shape (circle, triangle, and rectangle), and colour (red, blue, and green). A concept description covering either green rectangles or red triangles is represented by (shape = rectangle and colour = green) or (shape = triangle and colour =red). WAVEFORM-Generator Generates a problem of predicting one of three waveform types. It shares its origins with LED, and was also donated by [20] to the UCI repository. The goal of the task is to differentiate between three different classes of waveform, each of which is generated from a combination of two or three base waves. The optimal Bayes classification rate is known to be 86%. There are two versions of the problem, wave21 which has 21 numeric attributes, all of which include noise, and wave40 which introduces an additional 19 irrelevant attributes. AGARWAL-Generator Generates one of ten different pre-defined loan functions. It was a common source of data for early work on scaling up decision tree learners contributed by [22]. The generator produces a stream containing nine attributes, six numeric and three categorical. These attributes describe hypothetical loan applications. There are ten functions defined for generating binary class labels from the attributes. Presumably these determine whether the loan should be approved. HYPERPLANE-Generator Generates a problem of predicting class of a rotating hyperplane. It was introduced by [21]. A hyperplane in d- dimensional space is the set of points x that satisfy...(2) where x i, is the ith coordinate of x. Hyperplanes are useful for simulating time-changing concepts, because we can change the orientation and position of the hyperplane in a smooth manner by changing the relative size of the weights. We introduce change to this dataset adding drift to each weight attribute w i = w i +dσ, where σ is the probability that the direction of change is reversed and d is the change applied to every example. LED-Generator Generates a problem of predicting the digit displayed on a 7-segment LED display. This data source originates from the CART book. An implementation in C was donated to the UCI machine learning repository by David Aha. The main idea is contributed by [23]. The goal is to predict the digit displayed on a seven-segment LED display, where each attribute has a 10% chance of being inverted. It has an optimal Bayes classification rate of 74%. The particular configuration of the generator used for experiments produces 24 binary attributes, 17 of which are irrelevant. F. Classification Model in MOA The classification model in MOA is based on the four basic requirements of MDM. A data stream environment has different requirements from the traditional batch learning setting. The model is shown in fig 3. The main requirements in mining data streams are summarized as follows: The example has to be processed at a time, and inspected only once. 30 P a g e

Limited amount of memory can be used. Table I Results of MDM using Naive Bayes algorithm Work in a limited amount of time and Prediction can be made any time Figure 3.

5 Limited amount of memory can be used. Table I Results of MDM using Naive Bayes algorithm Work in a limited amount of time and Prediction can be made any time Figure 3. A classification model in MOA VI. EXPERIMENTS AND RESULTS The results of the present investigation are presented in table. Figures 4, 5 and 6 presents the graphical representation of the results obtained and are self explanatory and Other observations are presented below. The performance analysis of MOA is analysed using massive online analysis frame work carried on all the 8 data stream generators available in MOA. The data stream generators used in the analysis are LED, AGARWAL, HYPERPLANE, SEA, STAGGER, RANDOMRBF, RANDOMTREE, WAVEFORM. The experiment uses 100,000,000 data instances. Classifier used is Naive Bayes. The methodology constitutes the evaluation procedures. viz., Prequential and held out methods. The present work uses only prequential evaluation method. performance evaluator used is Window Classification Performance Evaluator(WCPE). From the result table it is found that the performance of Naive Bayes algorithm is excellent with accuracy =100%, Kappa=100%, Time= sec. As per the characteristic features of mining data streams the naive bayes algorithm takes almost null ram hours and memory used is also negligible. Figure 4. Graph of Accuracy and Data Streams for NB Figure 5 Graph of Time and Data Streams for NB 31 P a g e

Figure 6 Graph of Kappa vs Data Streams for NB VII. CONCLUSION The present work focuses on the performance analysis of naive bayes algorithm on eight different data set generators available in MOA.

6 Figure 6 Graph of Kappa vs Data Streams for NB VII. CONCLUSION The present work focuses on the performance analysis of naive bayes algorithm on eight different data set generators available in MOA. The number of instances are 100,000,000 in all the data stream generators. The final results are presented for ready reference. In this case the learning model is evaluated by using prequential evaluation method and windows classification performance evaluator. Naive Bayes performs better with accuracy =100% and Kappa= 100%. for stagger generator. The results of the present study provide a strong platform for enhancing the accuracy of the method effectively. Further, it is concluded that for massive data MDM technique is best suited and it has lot of scope for future research. REFERENCES [1] Aggarwal, C.C. (Ed.),"Data streams: Models and Algorithms," Series: Advances in Database Systems, Vol. 31, XVIII, 354 p, 2007, ebook,springer, Berlin Heidelberg. [2] Guha, S., Koudas, N.K. and Shim, K.,"Data Streams and Histograms, Proceedings of thirty-third annual ACM Symposium on Theory of Computing., 2003, pp., , ACM Press. [3] Srimani, P.K. and Patil, M. M, "Edu-Mining : A Machine learning approach", AIP Conference Proceedings 1414, pp ; 2011, doi: / ,jaipur, India. [4] Srimani, P.K. and Patil, M. M, "A Classification Model for Edu- Mining," In Proceedings of International Conference on Intelligent Computational Systems, pp., 35-40,2012, Dubai, UAE. [5] Srimani, P.K. and Patil, M. M., "A Comparative Study of Classifiers for Student Module in Technical Education System(TES)", International Journal of Current Research, Vol. 4, Issue, 01, pp., , [6] Srimani, P.K. and Patil, M. M. "Performance Evaluation of Classifiers for Edu-data: An Integrated Approach,", International Journal of Current Research, Vol. 4, Issue, 02, pp., , [7] Srimani, P.K. and Patil, M. M.," Data Stream Mining Using Landmark Stream Model for Offline Data Streams: A Case Study of Health Care Unit", in Proceedings of the 4th National Conference; INDIACom-2010 Computing For Nation Development, February 25 26, 2010 [8] Srimani, P.K. and Patil, M. M. "Massive data mining on Data streams Using Classification Algorithms,", International Journal of Engineering Science and Technology, Vol. 4, Issue 06, pp., ,2012. [9] Srimani, P.K. and Patil, M. M."Knowledge Discovery in Data Mining and Massive Data Mining", International Journal of Emerging Technologies in Computational and Applied Sciences 5, Vol. 1, 2, & 3, June-August, 2013, pp., [10] Bifet, A., Kirkby,R. Kranen, P, and Reutemann, P. "Massive Online Analysis", Technical Manual, University of Waikato, Hamilton, 2013, New Zealand. [11] Bifet, A and Kirkby, R."Data stream mining: A Practical Approach", Technical report, The University of Waikato, Hamilton, New Zealand. [12] Bifet, A.,Frank E, Holmes,G.., Pfahringer,B.,"MOA: Massive Online Analysis", Journal of Machine learning Research, pp., , [13] Bifet, A. Holmes,G, Pfahringer,B., Kirkby,R., and Gavaldà, R. "New ensemble methods for evolving data streams," Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp., ,2009, ACM. [14] Bifet,A, and Gavaldà, R. "Adaptive learning from evolving data streams," Advances in Intelligent Data Analysis VIII,pp., , 2009, Springer, Berlin Heidelberg. [15] Bifet, A.,Frank E, Holmes,G., Pfahringer,B.,"Accurate Ensembles for Data Streams Combining Restricted Hoeffding Trees Using Stacking,", Proc 2nd Asian Conference on Machine Learning, Tokyo., Journal of Machine Learning Research,. pp., , [16] Bifet, A., Eibe, F., Holmes, G., and Pfahringer, B. " Ensembles of restricted Hoeffding trees," ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 3, Issue 2, pp., 1-20, 2012, Publisher ACM. [17] Domingos,P, and Hulten,G. "Mining time-changing data streams,"in KDD 00, Proceedings of the sixth ACM SIGKDD International conference on Knowledge discovery and data mining pp., 71-80, 2000, NY, USA doi: / ACM Press. [18] Han, J. and Kamber, M.(ed.) "Data Mining : Concepts and Techniques," Morgon Kaufmann Publishers, 2007, San Francisco, CA. [19] Street W. N. and Kim Y., "A streaming ensemble algorithm (SEA) for large-scale classification,", in proceedings of International Conference on Knowledge Discovery and Data Mining, pp., , 2001, New York,USA, ACM Press. [20] Schlimmer J. C. and Granger R.H "Incremental learning from noisy data," International Conference on Machine Learning, 1(3), 1986, pp., [21] Aha D. UCI machine learning Repository, [22] Hulten, G., Spencer, L. and Domingos, P, "Mining time-changing data streams' In KDD, 2001, pages , ". [23] Agarwal, R., Ghosh, S. P., Imielinski, T., Iyer, B. R. and Swami, A. N. "An interval classifier for database mining applications," International Conference on Very Large Data Bases, 1992, pp., P a g e

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational