SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept of computer Science and Engg Dept of computer Science and Engg Jerusalem College of Engineering Jerusalem College of Engineering Jerusalem College of Engineering Anna University Chennai Anna University Chennai Anna university Chennai Chennai, India Chennai, India Chennai, India radhabai.p@gmail.com shefali210600@yahoo.com togeethamohan@gmail.com ABSTRACT - In the real-time environment stream data will have noise, missing values, and redundant features. This leads to a mechanism that will adapt to the preprocessing and prediction mechanism based on the scenario. Many learning approaches currently available adapt to changes in data. If the data is evolving overtime the algorithms should adapt to the changing environment. Automating the predictor with respect to preprocessing is a very difficult task. There are many models used for adapting the preprocessor and predictor separately. But those models do not predict accurately. In this paper, we propose a new scenario based on decoupling process is implemented for adaptive preprocessing and predictor. This method uses SVM classifier to classify the stream data and apply adaptive preprocessing and predictor with accuracy. Index terms Data evolution, Adaptive preprocessing, Support vector machine, Incremental learning. I. INTRODUCTION Data mining process is used to extract the knowledge from an existing data and transform it into a humanunderstandable structure for further use. It involves database and data management aspects, data preprocessing and model. There are three stages of KDD. The first stage is data preprocessing, which entails data collection, data smoothing, data cleaning, data transformation and data reduction. The second step is normally called as Data Mining (DM), involves data modelling and prediction. The third step is data postprocessing, which is the interpretation, conclusion, or inferences drawn from the analysis in second step. Data present in real world is incomplete (lacking attribute values), noisy (containing errors or outliers) and inconsistent. Due to this reason we are going for preprocessing. Adaptive preprocessing means when there is a shift in data, the classification or prediction models need to be adaptive to the changes. Preprocessing component in adaptive prediction system has two main connections, as illustrated in Fig. 1. First, the preprocessor may need feedback from the predictor to decide upon adapting or retraining itself. Second, the preprocessor produces a mapping that transforms the raw data, which is then used by the predictor. Thus, when deciding whether to decouple adaptivity of preprocessing and adaptivity of the predictor the consistency of the two links needs to be assessed and handled. Fig. 1. Preprocessing and prediction in adaptive system Data stream classification is challenging one because of many practical aspects associated with efficient processing and temporal behaviour of the stream. The dynamic and evolving nature of data streams pose special challenges to the development of effective and efficient algorithms [1][2]. Two of the most challenging characteristics of data streams are its infinite length and concept-drift. Concept-drift occurs in the stream when the underlying concepts of the stream change over time. There are many methods are implemented for preprocessing of data [3]. To automate preprocessing in adaptive learning is to keep preprocessing tied with adaptive predictors, which can be done in two cases. The first option is to keep the preprocessing fixed for the lifetime of the model. Only the predictor itself would adapt over time. The second option is required the retraining of preprocessing and a predictor to be synchronized. For presenting meaningful scenarios of adaptive preprocessing, we need to characterize adaptive learning approaches [4].These approaches describe mechanisms behind adaptive predictors, but they can be directly translated for application to adaptive preprocessors. Naive Bayes is a simple technique for constructing classifiers which assumes that the value of a particular feature is independent of the value of any other feature. The aim of Support Vector Machine (SVM) is to find the best classification function to distinguish between members of the 409
two classes in the training data. SVM insists on finding the maximum margin hyperplanes is that it offers the best generalization ability. It allows not only the best classification performance (e.g., accuracy) on the training data, but also leaves much room for the correct classification of the future data. II RELATED WORKS Adaptive preprocessing when learning from evolving streaming data is an issue and another issue is synchronizing multiple adaptive components in one online learning system when the components adapt at different phases. Several studies address the problem of adaptive feature space. Several works originating from different research groups relate to classifying textual streams [5], [6], [7]. Learning from textual data online requires adaptive feature space, because these works study how to incorporate new features incrementally, which is straightforward for classifiers that deal with individual attributes separately. predictor, improve the prediction accuracy, efficiently handle the overtime problem. Overall architecture diagram is shown below in figure 2. Detailed architecture diagram is shown in figure 2.1..Concept-drift occurs in the stream when the underlying concept of the data changes over time. Thus, the classification model must be updated continuously so that it reflects the most recent concept. Changes in data distribution can be described as concept drift, data evolution or both. There are three scenarios are shown in fig 2.First and second scenario is need to adapt only preprocessor. Third scenario is needed to adapt both pre-processor and predictor. Scenario 1: Data Evolution without decision boundary. Scenario 2: Data Evolution with decision boundary. Scenario 3: Incremental Learning. Another series of works [8], [9] consider dynamic feature selection in data streams. They specifically work with regression problems. These works relate via changing environment and dynamic feature selection keyword; however, the setting is different there. These works can be considered as active learning in attribute space, where the approaches actively select which attributes to observe next. Adaptive preprocessing has been addressed in stationary online learning [10] for another specific problem, namely, normalization of the input variables in online learning for neural networks so that they fall into range ½ 1; 1.The proposed approach links scaling of features with scaling of weights. In this case, however, the preprocessor is not adaptive. This study rather investigates the environment in which the neural network itself as a predictor can or cannot be adaptive. Fig 2: Architecture of Scenario based process III PROPOSED WORK The proposed system uses the SVM model and new scenarios will be implemented for preprocessing and predictor under certain circumstances to predict the final accuracy and classify the stream data in efficient manner. Existing system uses Naive Bayes (NB) classifier for adaptive preprocessing which are basically a probabilistic based assumption and the accuracy will be less. The advantages of proposed work are efficiently monitored and detect the adaptivity of preprocessing and 410
considered to be incremental, because the old model is not discarded to be learned from scratch, but only updated. There is a need to adapt both the preprocessor and the predictor. New conditions added with new data by two ways. 1) Instance method New attribute value is added in database. 2) Batch method The batch of untrained data (which is not already present in database)are added in database. D. SVM training and SVM testing Fig 2.1: Detailed Architecture of SVM classifier A. Data Preparation and stream generation The dataset has taken from file which is downloaded from web. The dataset contained in the file is converted in to the table form for further processing. The continuous dataset which consists of various attributes like index counter, dateofacquisition, outside Temperature, outside Humidity and barometric Pressure. The Stream data generated by Java thread concept with Rand() function. B. Data Evolution We need to handle the data evolution situation (Changes in data distribution.).initially prepare the discrete dataset which consists of various attributes like temperature, humidity and pressure. If the incoming data is with missing or null values then replace the missed values or null values by mean method and then send to the predictor. There is a need to adapt the preprocessor. Removing outliers and replacing null values in the dataset by using preprocessing process. Min-max normalization technique is used for preprocessing and SVM classifier is used for Prediction. Two scenarios are 1) Data evolution without decision boundary 2) New Data evolution with Decision boundary. C. Incremental Learning Incremental learning approaches [11] can increment at an instance level, at batch level or at an ensemble level. At a batch level, the parameters of the model can only be updated after a number of incoming data points have been seen. For instance, more than one new data point may be needed for estimating the current accuracy. This approach is A set of sample data is collected and normalized and then trained in the training phase. 1)Sample data is converted into machine learning data using SVMsearchtrain() method. 2)Perform iteration till get the less error rate. A set of Input data is collected and tested in the testing phase. 1)Makeparse() method is used to convert the input data into machine learning data. 2)Classify() method is used to classify the data using SVM_predict() method. SVM Steps: Training: Kernel function (RBF) separates the data by a hyper plane. Tuning: Tune the kernel means retrain the SVM classifier Testing: Classify new data using predict method ALGORITHM: Support Vector Machine 1) The original input space is mapped to some higherdimensional feature space (Φ: x φ(x)). 2) Choose a kernel function. Radial-Basis Function (RBF) kernel i j K( xi, x j ) = exp( x x ) 2 2σ 3) Solve the quadratic programming problem. n n n 1 maximize α α α y y K( x, x ) 4) Construct the discriminant function from the support vectors. i SV 2 i i j i j i j i= 1 2 i= 1 j= 1 g( x) = α K( x, x) + b i i 411
Preprocessing Technique: 1) MIN-MAX method is used for normalization purpose. 2) Mean method is used for replacing missing values. Dataset: The input data set will provide a brief overview about the attributes related to the weather data which consists of various attributes like index counter, dateofacquisition, outside Temperature, outside Humidity and barometric Pressure. IV PERFORMANCE RESULTS An analysis against accuracy of classifier is analysed which can be concluded that support vector machine with more accurate classifier than NB classifier which is shown in figure 3. Fig: 4 Preprocessing A set of sample data is collected and trained in the SVM training phase. First the data is normalized from -1 to +1 and they are stored in the training phase using probability function. Training dataset is represented in the figure 5 and figure 5.1 which is shown below. Fig: 3 Accuracy Vs Number of data Fig: 5.1 Training data V IMPLEMENTATION The preprocessing technique needs to be done before training the data. Here min-max normalization method is used for preprocessing. Preprocessing is done in the figure 4 which is shown below. 412
Fig: 7 SVM Testing Fig: 5 SVM Training Dataset is taken from file which is downloaded from web. The continuous dataset which consists of various attributes like index counter, dateofacquisition, outside Temperature, outside Humidity and barometric Pressure. The Stream data generated by Java thread concept with Rand() function is shown in figure 6. V CONCLUSION The dynamic and evolving nature of data streams pose special challenges to the development of effective and efficient algorithms. Two of the most challenging characteristics of data streams are its infinite length and concept-drift. In this paper we have introduced a scenario based decoupling process which is implemented for adaptive pre-processing and predictor. The stream data is classified using SVM model and decoupling process is done efficiently and scenario are created based on overtime conditions and change in environments. It will predict more accuracy than other methods. For future work we can automatically predicting and pre-processing with respect to change in time and environment using Big Data Concept. REERENCES [1] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavalda, New Ensemble Methods for Evolving Data Streams, Proc. 15th ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD 09), pp. 139-148, 2009 [2] E. Ikonomovska, J. Gama, and S. Dzeroski, Learning Model Trees from Evolving Data Streams, Data Mining Knowledge Discovery, vol. 23, no. 1, pp. 128-168, 2011. Fig: 6 Stream Generation The input data is classified with the trained data using The distributed density function and if the data is in the range 0 to 1 it is considered as valid or if the data is in the range from 0 to -1 then the data is invalid. Testing result is represented in figure 7 is shown below. [3] M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints, IEEE Trans. Knowledge and Data Eng., vol. 23, no. 6, pp. 859-874, June 2011 [4] G. Widmer and M. Kubat, Learning in the Presence o ConceptDrift and Hidden Contexts, Machin Learning, vol23, pp. 69-101,1996 413
[5] M.M.Q. Chen, J. Gao, L. Khan, J. Han, and B. Thuraisingham, Classification and Novel Class Detection of Data Streams in a Dynamic Feature Space, Proc. European Conf. Machine Learning and Knowledge Discovery Databases: Part II (ECML PKDD 10),pp. 337-352, 2010. [6] I. Katakis, G. Tsoumakas, and I. Vlahavas, Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams, Proc. ECML/PKDD 06 Int l Workshop Knowledge Discovery from Data Streams, pp. 107-116, 2006. [7] B. Wenerstrom and C. Giraud-Carrier, Temporal Data Mining in Dynamic Feature Spaces, Proc. Sixth Int l Conf. Data Mining (ICDM 06), pp. 1141-1145, 2006. [8] C. Anagnostopoulos, D. Tasoulis, D. Hand, and N. Adams, Online Optimization for Variable Selection in Data Streams, Proc. 18th European Conf. Artificial Intelligence (ECAI 08), pp. 132-136, 2008. [9] C. Anagnostopoulos, N. Adams, and D. Hand, Deciding what to Observe Next: Adaptive Variable Selection for Regression in Multivariate Data Streams, Proc. ACM Symp. Applied Computing (SAC 08), pp. 961-965, 2008. [10] H.Ruda, Adaptive Preprocessing for on-line Learning with Adaptive Resonance Theory (Art) Networks, Proc. IEEE WorkshopNeural Networks for Signal Processing (NNSP), 1995. [11] I. Katakis, G. Tsoumakas, and I. Vlahavas, Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams, Proc. ECML/PKDD 06 Int l Workshop Knowledge Discovery from Data Streams, pp. 107-116, 2006. 414