ONLINE ALGORITHMS FOR HANDLING DATA STREAMS

Size: px
Start display at page:

Download "ONLINE ALGORITHMS FOR HANDLING DATA STREAMS"

Transcription

1 ONLINE ALGORITHMS FOR HANDLING DATA STREAMS Seminar I Luka Stopar Supervisor: prof. dr. Dunja Mladenić Approved by the supervisor: (signature) Study programme: Information and Communication Technologies (ICT3). Doctoral degree... Ljubljana, 2014

2

3 I Abstract In recent years the problem of mining data streams has gained much attention in the data mining community. A data stream is a continuous, time-changing flow of data. Examples of data streams include: sensor networks, twitter feeds, news articles, query logs, ATM transactions, etc. As data streams are potentially infinite, they cannot be just stored into memory and mined from there. Furthermore their distribution can change over time potentially rendering old data useless or even harmful. This phenomena is known as concept drift. The algorithms for mining data streams must process data online, updating their model in real time. When concept drift occurs they must be able to adapt their model to changes. Many such methods have been proposed in the scientific literature ranging from Decision Trees to Artificial Neural Networks. This paper provides an overview of the growing field of mining data streams. We review several methods and frameworks developed specifically for mining data streams. Furthermore we give a critical judgment of the existing research and present some further research opportunities. Keywords: data mining, stream mining, online algorithms, concept drift, regression, classification

4 II

5 III Contents 1 Introduction Problem Definition Data Streams Data Stream Mining Related Work Decision and Regression Trees for Stream Mining Naïve Bayes for Stream Mining Support Vector Machines for Stream Mining Large Margin Classifiers for Stream Mining Local Linear Models for Stream Mining Artificial Neural Networks for Stream Mining K-Means for Stream Mining Online Hierarchical Clustering for Stream Mining Ensembles for Stream Mining Algorithm Output Granularity Critical Judgment Further Work References... 11

6 IV

7 1 1 Introduction In recent years, the progress in technology, especially hardware, has made it possible for organizations to record and store massive data sets. Such data sets include web logs, Twitter feeds, news articles, phone conversations, ATM transactions and sensor network observations. They are characterized by their volume, velocity and variety, and often called Big Data. Big Data is a collection of data sets, so vast that they are difficult to process using traditional tools and techniques. The challenges when dealing with Big Data include storage, querying, sharing, visualization and analysis. As companies begin implementing Big Data solutions many opportunities present themselves to the research community. In many Big Data applications data arrives sequentially and continuously, and the system has no influence on the order and arrival times. Furthermore as time passes, the distribution generating the data may change, making models inaccurate and often obsolete, a phenomena also known as concept drift. Such data sets are called data streams and require traditional methods to be adjusted and/or redesigned to cope with the constraints presented above. Traditional data mining techniques usually assume the data comes from a static source and can be stored in memory before processing. Many of them require multiple passes over the data set in order to build a static model which they then apply to previously unseen instances. When the data set is a potentially infinite data stream, it becomes technically infeasible to just load it into memory and operate on it from there. Traditional data mining techniques have to be redesigned to process the data online and update their model in real-time. They can store only a small sample of the stream, the rest they must summarized and forgotten. Furthermore they have to detect when concept drift occurs and adapt their model so it does not become obsolete. The remainder of this paper is structured as follows. In section 2 we provide a formal definition of the problem of mining data streams. In section 3 we provide a survey of the related work in the research field. In section 4 we give critical judgment on the related work and finally in section 5 we present some research opportunities and directions for further work.

8 2 2 Problem Definition 2.1 Data Streams A data stream can be modelled as a sequence of data instances arriving continuously and sequentially in real time. The system has no control of the order nor the frequency in which they arrive. Because the stream is potentially unbounded the system can either discard or store the instance once it has been processed, but it can only store a small fraction of the entire data set, the rest must be forgotten. More formally a data stream is a sequence of pairs (ss, ) where ss is a sequence of instances and is a sequence of real time intervals. The elements of ss are generated by a data source OO according to a distribution DD which may, or may not, change over time. The major constraints in this model are: The length of the sequence is potentially unbounded, so it is impossible to store it. Only a small summary can be computed and stored and the rest of information must be discarded (volume). The frequency of arrival is potentially very large and non-constant, so the instance must essentially be processed in real-time (velocity). The distribution generating the sequence may change over time, so past data may become irrelevant or even harmful (variety). 2.2 Data Stream Mining Data mining is often defined as a set of methods that allow for pattern detection under uncertain conditions. Machine learning provides the technical basis of data mining. It is used to extract information from the raw data information that is expressed in a comprehensible form and can be used for a variety of purposes [1]. A variety of machine learning methods include traditional algorithms, like k-means and decision trees, and statistical algorithms like Support Vector Machines, Artificial Neural Networks, Local Linear Models, etc. These methods assume the data is sampled from a stationary distribution and stored in memory before processing. Most require multiple passes over the data and build a static model which is then used for pattern detection and prediction. Their goal is to summarize the data as simply, usefully and elegantly as possible. This summarization can be a mean function, a prediction or a count of how many times a certain event occurred. In the data stream model it is not technically feasible to simply save the incoming instances into a database and operate on them from there. Furthermore because the distribution generating the instances can change, the algorithms must be able need to be able to adapt. Under the constraints presented above the main properties of an ideal model become: high accuracy, fast adaptation rate and low time and space complexity. Furthermore they must operate online and offer a prediction any time in real-time. Some basic data stream mining techniques include: sampling, sketching, load shedding, synopsis, aggregation, wavelets and sliding windows. We will discuss some of them here

9 3 shortly: Sampling is a useful technique for slowing down the sampling rate. It makes a probabilistic choice of whether an instance will be processed or not. It is used as a universal method to reduce the running time of computations as it allows the computation to be performed on a much smaller data set and the result scaled to compensate for the difference in size. Boundaries on the error rate can be computed as a function of the sampling rate. Sketching is the process of projecting the domain of the instances onto a significantly smaller domain using random functions. Like sampling error bounds can be computed [2]. Load shedding refers to the process of dropping a fraction of the data streams during periods of system overload. It is desirable to shed the load in a way that minimizes the drop in accuracy. Synopsis data structures hold summary information of the data streams. This embodies the idea of small memory complexity and approximate solutions to massive data set problems. The complexities of constructing these structures cannot be more then OO(nn) but some solutions which give results closer to OO(llllll(nn)) are needed. Aggregation is a technique of summarizing the incoming data stream. Aggregation functions include mean, variance, maximum, minimum, etc. This technique offers vast memory savings, but can fail if the stream is highly fluctuating. Wavelets are a mathematical technique for representing signals as a weighted sum of simpler, fixed building waveforms at different scales and positions. They attempt to capture trends in numerical functions, decomposing a signal into a set of coefficients [3]. Sliding window is a technique where old instances are removed and replaced by new ones. Two types of sliding windows are called count-based and time-based. A count-based sliding window stores the NN most recent elements while a time-based sliding window stores all the elements that arrived in the latest NN units of time. Other stream mining methods include algorithms developed specifically for processing data streams. These are usually traditional data mining algorithms modified to cope with the constraints presented above. They process instances sequentially, using only a limited amount of memory and update their model before a new instance arrives, usually producing approximate results. We will discuss some of these methods in the next section.

10 4 3 Related Work Since the beginning of the new century mining data streams has gained much attention in the data mining community. The requirement to process fast data streams has motivated the need for approximation algorithms that make use of small amounts of space and time. In this section we give an overview of the different methods and algorithms developed for mining data streams. 3.1 Decision and Regression Trees for Stream Mining Domingos et al. [4] proposed a general approach for scaling up machine learning algorithms called Very Fast Machine Learning (VFML). Their approach requires the user to define a loss function L which penalizes the difference between the model and a hypothetical model built on infinitely many instances. The user then has to bound the loss function by the number of instances the algorithm uses in each step. The user can then compute the number of instances, used in each step of the algorithm that minimizes the running time while respecting the bound on the loss function. In their later work [5] [6] they applied this methodology to mining data streams. They proposed a decision tree induction algorithm called Very Fast Decision Tree (VFDT) and later Concept-adapting Very Fast Decision Tree (CVFDT). When inducing a decision tree, VFDT algorithm uses only a small sample of the available data instances when choosing the split attribute. This makes it suitable for processing data streams where the whole data set is not available or is too large to store in memory. To determine the number of examples needed to split a node, the algorithm uses the Heoffding or Chernov bound which guarantees that, with certain confidence, the attribute it has chosen is the correct one. When the system memory becomes low VFDT reduces its memory requirements by temporarily deactivating learning in the unpromising nodes. CVFDT is an extension to VFDT, which in addition to inducing the decision tree incrementally, allows the underlying decision tree to adapt when concept drift occurs. When this happens, some attributes that previously passed the Hoeffding bound will no longer do so. In this case CVFDT starts to grow an alternative sub-tree and replaces the node when the new sub-tree becomes more accurate. Since then online decision trees have gained much attention in the stream mining community. Rutkowski et al. [7] give a mathematical justification for online decision trees and present an algorithm they call Gaussian Decision Tree (GDT) induction, where they propose a statistical test used to determine the best attribute to split the node. Ikomonovska, Gama and Dzeroski [8] proposed an online regression and model tree induction algorithm which is able to incrementally build and maintain a model tree. 3.2 Naïve Bayes for Stream Mining Godec et al. [9] used an online version of the Naive Bayes (NB) classifier for multi-label classification and visual tracking. To adapt the classifier to mining data streams they approximated the probability distributions with equally binned histograms which were updated as new data arrived. To cope with concept drift they proposed exponential moving average filtering for each bin where the exponential decay is an input parameter.

11 5 3.3 Support Vector Machines for Stream Mining Cauwenberghs and Poggio [10] introduced an incremental and decremental algorithm for learning the Support Vector Machine (SVM) classifier. The algorithm splits the instances into three sets: SS, EE and RR, and where SS contains all vectors strictly on the margin, EE contains vectors exceeding the margin and RR contains the (ignored) vectors within the margin. To satisfy Karush-Kuhn-Tucker (KKT) conditions when a new instance arrives, the coefficients αα ii are updated for all the vectors in the three sets, and some vectors change sets. They show that the algorithm computes the same solution as the batch SVM version. The asymptotical running time of the incremental algorithm is OO(nn), where n is the number of instances, this makes it infeasible to run the algorithm on infinitely many instances. However because of the decremental algorithm, the SVM model can be maintained on s sliding window. Furthermore the system can select the useless instances which it wishes to unlearn and frees memory for the useful instances. Diehl and Cauwenberghs [11] later extended this work by proposing an algorithm which is able to adapt the current model to changes in kernel and regularization parameters. Ma, Theiler and Perkins [12] presented a regression counterpart to the incremental SVM classifier by following the approach of Cauwenberghs and Poggio. They call their algorithm Accurate On-line Support Vector Regression (AOSVR). 3.4 Large Margin Classifiers for Stream Mining Harrington et al. [13] propose an algorithm for learning large margin classifiers. The algorithm is an average perceptron-like algorithm which achieves the diversity in its models using a bagging-like technique. When a new instance arrives it is given to perceptron j with probability p where p is a system parameter which controls the diversity in the perceptrons. The perceptrons are all independent and can be parallelized. One of the disadvantages of this algorithm is the assumption that the instances can be linearly separated. This can however be overcome using the kernel trick but at the cost of time and space complexity. 3.5 Local Linear Models for Stream Mining Vijayakumar et al. [14] present an algorithm for incremental non-linear function approximation in high dimensional spaces. The algorithm assumes that the high dimensional space has locally low dimensional distributions and so only a small part of it needs to be filled with linear models. It incrementally learns the number of linear models, their coefficients and their region of validity, parameterized in a Gaussian kernel by a distance metric. To update the local models the algorithm uses partial least squares to compute orthogonal projections of the input space and performs linear regression on the projected directions. Predicting involves computing a weighted sum of the predictions of all the local models. 3.6 Artificial Neural Networks for Stream Mining Artificial Neural Networks (ANN) [3] are a general function approximation method. A three-layer ANNs can approximate any continuous function with arbitrary precision. In the traditional setting ANNs make several iterations over the training data which makes them

12 6 slow and many times makes them overfit the data. However because of the vast number of instances available in the stream mining setting, each instance can be processed only once. When the label becomes available, the error is back-propagated through the network and the weights are adjusted. Thus the network is able to process infinitely many instances and can update in real-time. 3.7 K-Means for Stream Mining Clustering is perhaps the most frequently used algorithm used in exploratory data analysis. Domingos et al. [4] proposed an approximation algorithm to the k-means problem called Very Fast K-Means (VFKM), following the VFML approach. VFKM uses the Hoeffding bound to determine the number of instances needed in each iteration of the k-means algorithm. It runs a sequence of iterations where each iteration is executed on an increasing number of instances. The algorithm terminates when a statistical bound is reached. McCutchen and Khuller [15] proposed a constant-factor approximation algorithm for the k-center algorithm. The k-center algorithm differs from k-means in minimizing the maximum distance from any instance to its centroid. To make the method robust to outliers, the algorithm accumulates all the incoming data points until enough of them are close together before forming a new cluster. Once a cluster is created any new points that fall into the cluster are forgotten. The algorithm thus holds a maximum of OO(kkkk) points in memory where z is the maximum number of outliers and k is the number of clusters. Once the outliers are accumulated it runs a 4-approximation offline algorithm for k-center clustering with outliers to attempt to cover the points with remaining clusters. If the number of clusters ever becomes more then k the algorithm has a mechanism to drop clusters and their support points. 3.8 Online Hierarchical Clustering for Stream Mining Rodrigues et al. [16] proposed an online hierarchical clustering algorithm for clustering data streams. They called the algorithm Online Divisive-Agglomerative Clustering (ODAC). It builds, and maintains, a tree like hierarchical structure of clusters using a topdown approach. To split the nodes the algorithm continuously monitors the diameter of each leaf. It does this by computing Pearson s correlation coefficient between all the variables in the node and takes the minimum correlation. When a certain condition on the diameter holds it splits the node assigning each variable to one of the children. To determine the minimum number of instances needed to split a node, the system uses Hoeffdings bound which guarantees, that with a certain confidence level, the split was correct. To cope with concept drift the algorithm continuously monitors and compares the diameter of each parent node to the diameter of its children. When a child s diameter, with a certain confidence level, becomes larger than its parents, the system resets the parent node. 3.9 Ensembles for Stream Mining When dealing with data streams, ensembles of models offer several advantages over single model methods. Ensembles are combinations of several models whose individual predictions are combined in some manner to form a final prediction. They are easy to scale, parallelize and can handle concept drift by pruning bad parts of the ensemble.

13 7 Park et al. [17] propose a framework for distributed multimedia stream mining systems. Their framework is a set of classifiers organized in a tree topology. The inner classifiers in the tree are used as filters, filtering data based on a semantic hierarchy of concepts, while leafs classify the actual class of interest. Each classifier can have access to the information of the classifiers in its sub-tree and can form a so-called coalition. It can then select a strategy that maximizes the utility of the entire coalition. They used the framework to identify semantic concepts from a stream of sports images, where they setup a hierarchy of SVM classifiers trained on low level features, such as color histograms, and compared its performance to a centralized approach where the problem is modeled as an optimization problem. Kotler and Maloof [18] presented a method for tackling concept drift based on the Weighted Majority algorithm. They called their method Dynamic Weighted Majority. It maintains a set of online learning algorithms called experts, each with its associated weight. When predicting the algorithm combines the predictions of the experts and forms the final prediction as the class label with the highest accumulated weight. If an expert predicts incorrectly its weight is multiplied by a constant factor ββ. If the global prediction is incorrect the algorithm may create a new expert with weight 1. The algorithm also removes experts with weights lower than a predefined threshold. In [19] Bifet et al. present a method of bagging using Adaptive-Size Hoeffding trees, which have a maximum number of split nodes and can remove nodes to reduce their size. The method limits the size of the nn-th tree as half the size of the (nn 1)-th tree and gives each tree a weight, proportional to the inverse of its squared error, computed as an exponential moving average. The main idea behind this method is that smaller trees will be able to adapt to concept drift faster than larger ones while the latter will perform better when the distribution of the stream remains stationary. Chernov and Vovk [20] introduce a framework of prediction with experts advice, where they formulate the prediction problem as a game-theoretic problem of a game played by a learner and N experts. The goal of the learner is to perform better or at least not worse than each expert. In order to compare the learners loss function with the experts, his loss function is computed separately against each expert (e.g. each expert evaluates his performance and the performance of the learner according to its own criteria). They apply their framework with the defensive forecasting algorithm and to the framework of specialist experts, where an expert may abstain from making a prediction Algorithm Output Granularity Algorithm Output Granularity (AOG) is a new approach proposed by Gaber et al. [21]. It is the first resource-aware approach where the system is able to adapt to time and memory constraints as well as the data stream rate. The main steps of the algorithm are the following: Compute the algorithm threshold according to the incoming data rate and available memory. Mine the incoming data stream using the algorithm threshold. After a period of time re-compute the algorithm threshold using linear regression. This adaptation is achieved using the algorithm threshold which is first estimated and then periodically recomputed to cope with variations with the sampling rate. They propose three

14 8 lightweight algorithms that follow this approach: one for clustering, one for classification and one for counting frequent items. We will discuss only the first two. The proposed clustering technique is a one pass clustering technique called LWC and works as follows. The first instance is considered as the first centroid. As new instances arrive, the distance to the nearest centroid is computed. If this distance is less than a threshold t (which is periodically adjusted) a new center is created. Otherwise the weight of the nearest centroid is increased by one. When the number of centers becomes k (k depends on the available memory) the algorithm starts updating the cluster vector. When memory becomes full the algorithm integrates clusters. The classification technique is an adaptation of the k-nearest neighbor method (KNN). Instead of storing all the instances in memory the algorithm measures the distance of the new instance to the nearest one already stored in memory. If this distance is less than a threshold (which is periodically adjusted), and the classes are the same, the algorithm stores the average of the two with an increased weight. If the classes differ the algorithm deletes both instances from memory. When classifying an instance the number of nearest neighbors is also chosen according to the time constraints.

15 9 4 Critical Judgment Our review shows that many algorithms have been developed for mining data streams. These are usually traditional data mining algorithms modified for the constraints of the data stream model. They generally follow one of the two approaches: one-pass approximation or sliding window. The methods that follow the one-pass approach normally approximate traditional data mining algorithms and typically suffer from lower accuracy. Furthermore many of these methods do not even address concept drift and those that do usually suffer from heavier time and space complexity. While the methods that follow the sliding window approach usually produce an exact model but only on the most recent data. In the real world, data streams come from different sources, e.g. sensor observations, and are typically noisy, contain redundant features and potentially missing values so they must be preprocessed. When dealing with concept drift adapting only the predictor may not be enough to maintain good accuracy as the system may fail to detect changes in the raw data. The issue of adaptive preprocessing was raised in [22], where they identified scenarios where adaptive preprocessing is beneficial, but much research is still needed. Most of the methods presented in the previous section have only been evaluated on specific domains with only one type of data, such as text, images or sensor observations. A comparison of the methods on different kinds of data (variety) is much needed. Furthermore a general framework which could cope with variety is needed. Most of the methods found in the literature do not take into consideration the sampling rate of the incoming data stream. The user has to check if the method can adapt faster than the instances arrive and has to manually retune the parameters to make the method responsive. One of the approaches offers a solution to this problem, but only basic algorithms that follow that methodology have been proposed. With the vast amounts of data flowing into the system overfitting becomes a big risk. While overfitting is mentioned in the literature, it is not addressed as an open issue that needs special consideration.

16 10 5 Further Work Some ideas for future work include: Context-aware mining: When predicting future events contextual information may improve prediction accuracy substantially. For example when predicting traffic congestions it is not enough to only consider data coming from the sensor network, information like time of day, day of the week, weather forecast and information gained from social networks should also be taken into account. Most techniques presented in previous sections do not consider context when building models and making predictions. As part of further work context aware techniques or frameworks may be investigated and developed. Cost of updating vs. accuracy gain: Most of the algorithms proposed in the literature update their model continuously as instances arrive, stressing the underlying hardware and making it unavailable for other tasks. Future work may include investigating the tradeoff between the cost of updating the model versus the accuracy gained by updating. Furthermore a methodology, which takes cost into consideration and only updates the model when high gains are expected, may be developed. Addressing variety: In real-world applications data streams come in several forms. Examples include: sensor network observations, news articles, images, etc. The proposed methods may be tested on a variety of different types of data sources, and a thorough comparison given. Furthermore methods and frameworks that deal with various types of data may be developed. Development of new methods: Many methods and techniques have been developed specifically for mining data streams. There is, however still room for improvement. A thorough investigation of the current techniques can be performed and new, one pass, algorithms designed specifically for mining data streams developed. Overfitting: When dealing with large amounts of data overfitting becomes a serious risk. A study of this phenomena in data stream mining may be performed and this issue may be addressed thoroughly. Change detection: When mining data streams the distribution of the data can change, potentially rendering models obsolete. Future work may include development of change detection techniques and methods for change adaptation.

17 11 6 References [1] I. H. Witten, F. Eibe, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, [2] F. Rusu and A. Dobra, Sketching sampled data streams, 2009 IEEE 25th Int. Conf. Data Eng., pp , Mar [3] J. Gama and M. Gaber, Learning from Data Streams. Springer, [4] P. Domingos and G. Hulten, A general method for scaling up machine learning algorithms and its application to clustering, ICML, [5] P. Domingos and G. Hulten, Mining high-speed data streams, Proc. sixth ACM SIGKDD Int. Conf. Knowl. Discov. data Min. - KDD 00, pp , [6] G. Hulten, L. Spencer, and P. Domingos, Mining time-changing data streams, Discov. data Min., vol. 18, pp. 1 10, [7] L. Rutkowski, M. Jaworski, L. Pietruczuk, and P. Duda, Decision Trees for Mining Data Streams Based on the Gaussian Approximation, IEEE Trans. Knowl. Data Eng., vol. 26, no. 1, pp , Jan [8] E. Ikonomovska, J. Gama, and S. Džeroski, Learning model trees from evolving data streams, Data Min. Knowl. Discov., vol. 23, no. 1, pp , Oct [9] M. Godec, C. Leistner, A. Saffari, and H. Bischof, On-Line Random Naive Bayes for Tracking, th Int. Conf. Pattern Recognit., no. 2, pp , Aug [10] G. Cauwenberghs and T. Poggio, Incremental and decremental support vector machine learning, Adv. neural Inf., [11] C. P. Diehl and G. Cauwenberghs, Svm incremental learning, adaptation and optimization, Proc. Int. Jt. Conf. Neural Networks, 2003., vol. 4, no. x, pp , [12] J. Ma, J. Theiler, and S. Perkins, Accurate on-line support vector regression., Neural Comput., vol. 15, no. 11, pp , Dec [13] E. Harrington, R. Herbrich, and J. Kivinen, Online bayes point machines, Proc. Seventh Pacific-Asia Conf. Knowl. Discov. Data Min., [14] S. Vijayakumar, A. D Souza, and S. Schaal, Incremental online learning in high dimensions., Neural Comput., vol. 17, no. 12, pp , Dec [15] R. M. Mccutchen and S. Khuller, Streaming Algorithms for k -Center Clustering with Outliers and with Anonymity, in Approximation Randomization and Conbinatorial Optimization, Springer, 2008, pp

18 12 [16] P. Rodrigues, Hierarchical clustering of time-series data streams, Knowl. Data, vol. X, no. X, pp. 1 12, [17] H. Park and D. Turaga, A framework for distributed multimedia stream mining systems using coalition-based foresighted strategies,, Speech Signal, pp , [18] J. Z. Kolter and M. a. Maloof, Dynamic weighted majority: a new ensemble method for tracking concept drift, Third IEEE Int. Conf. Data Min., pp , [19] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà, New ensemble methods for evolving data streams, Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discov. data Min. - KDD 09, p. 139, [20] A. Chernov and V. Vovk, Prediction with expert evaluators advice, in Algorithmic Learning Theory, 2009, pp [21] M. Gaber, Adaptive mining techniques for data streams using algorithm output granularity, Australas. Data Min., [22] B. Gabrys, Adaptive Preprocessing for Streaming Data, IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp , 2014.

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Lecture 7. Data Stream Mining. Building decision trees

Lecture 7. Data Stream Mining. Building decision trees 1 / 26 Lecture 7. Data Stream Mining. Building decision trees Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 26 1 Data Stream Mining 2 Decision Tree Learning Data Stream Mining 3

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

Review on Gaussian Estimation Based Decision Trees for Data Streams Mining Miss. Poonam M Jagdale 1, Asst. Prof. Devendra P Gadekar 2

Review on Gaussian Estimation Based Decision Trees for Data Streams Mining Miss. Poonam M Jagdale 1, Asst. Prof. Devendra P Gadekar 2 Review on Gaussian Estimation Based Decision Trees for Data Streams Mining Miss. Poonam M Jagdale 1, Asst. Prof. Devendra P Gadekar 2 1,2 Pune University, Pune Abstract In recent year, mining data streams

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1, Albert Bifet 2, Bernhard Pfahringer 2, Geoff Holmes 2 1 Department of Signal Theory and Communications Universidad

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Feature Based Data Stream Classification (FBDC) and Novel Class Detection

Feature Based Data Stream Classification (FBDC) and Novel Class Detection RESEARCH ARTICLE OPEN ACCESS Feature Based Data Stream Classification (FBDC) and Novel Class Detection Sminu N.R, Jemimah Simon 1 Currently pursuing M.E (Software Engineering) in Vins christian college

More information

EFFICIENT ADAPTIVE PREPROCESSING WITH DIMENSIONALITY REDUCTION FOR STREAMING DATA

EFFICIENT ADAPTIVE PREPROCESSING WITH DIMENSIONALITY REDUCTION FOR STREAMING DATA INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 EFFICIENT ADAPTIVE PREPROCESSING WITH DIMENSIONALITY REDUCTION FOR STREAMING DATA Saranya Vani.M 1, Dr. S. Uma 2,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REAL TIME DATA SEARCH OPTIMIZATION: AN OVERVIEW MS. DEEPASHRI S. KHAWASE 1, PROF.

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant

More information

Classification: Feature Vectors

Classification: Feature Vectors Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12

More information

Lecture #11: The Perceptron

Lecture #11: The Perceptron Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Machine Learning and Pervasive Computing

Machine Learning and Pervasive Computing Stephan Sigg Georg-August-University Goettingen, Computer Networks 17.12.2014 Overview and Structure 22.10.2014 Organisation 22.10.3014 Introduction (Def.: Machine learning, Supervised/Unsupervised, Examples)

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Incremental Classification of Nonstationary Data Streams

Incremental Classification of Nonstationary Data Streams Incremental Classification of Nonstationary Data Streams Lior Cohen, Gil Avrahami, Mark Last Ben-Gurion University of the Negev Department of Information Systems Engineering Beer-Sheva 84105, Israel Email:{clior,gilav,mlast}@

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Incremental Learning Algorithm for Dynamic Data Streams

Incremental Learning Algorithm for Dynamic Data Streams 338 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.9, September 2008 Incremental Learning Algorithm for Dynamic Data Streams Venu Madhav Kuthadi, Professor,Vardhaman College

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18,   ISSN International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, www.ijcea.com ISSN 2321-3469 COMBINING GENETIC ALGORITHM WITH OTHER MACHINE LEARNING ALGORITHM FOR CHARACTER

More information

Data Mining. Lecture 03: Nearest Neighbor Learning

Data Mining. Lecture 03: Nearest Neighbor Learning Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F. Provost

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture Prof. L. Palagi References: 1. Bishop Pattern Recognition and Machine Learning, Springer, 2006 (Chap 1) 2. V. Cherlassky, F. Mulier - Learning

More information

Cluster based boosting for high dimensional data

Cluster based boosting for high dimensional data Cluster based boosting for high dimensional data Rutuja Shirbhate, Dr. S. D. Babar Abstract -Data Dimensionality is crucial for learning and prediction systems. Term Curse of High Dimensionality means

More information

Nesnelerin İnternetinde Veri Analizi

Nesnelerin İnternetinde Veri Analizi Nesnelerin İnternetinde Veri Analizi Bölüm 3. Classification in Data Streams w3.gazi.edu.tr/~suatozdemir Supervised vs. Unsupervised Learning (1) Supervised learning (classification) Supervision: The training

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F.

More information

Optimizing the Computation of the Fourier Spectrum from Decision Trees

Optimizing the Computation of the Fourier Spectrum from Decision Trees Optimizing the Computation of the Fourier Spectrum from Decision Trees Johan Jonathan Nugraha A thesis submitted to Auckland University of Technology in partial fulfillment of the requirement for the degree

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/20/2010 Announcements W7 due Thursday [that s your last written for the semester!] Project 5 out Thursday Contest running

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³ International Journal of Scientific & Engineering Research, Volume 7, Issue 5, May-2016 45 Classification of Big Data Stream usingensemble Classifier Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³ Abstract-

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Engineering the input and output Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Attribute selection z Scheme-independent, scheme-specific

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs) Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based

More information

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA More Learning Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA 1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector

More information

Challenges in Ubiquitous Data Mining

Challenges in Ubiquitous Data Mining LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 2 Very-short-term Forecasting in Photovoltaic Systems 3 4 Problem Formulation: Network Data Model Querying Model Query = Q( n i=0 S i)

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams

Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams International Refereed Journal of Engineering and Science (IRJES) ISSN (Online) 2319-183X, (Print) 2319-1821 Volume 4, Issue 2 (February 2015), PP.01-07 Novel Class Detection Using RBF SVM Kernel from

More information

Support Vector Machines

Support Vector Machines Support Vector Machines . Importance of SVM SVM is a discriminative method that brings together:. computational learning theory. previously known methods in linear discriminant functions 3. optimization

More information

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday. CS 188: Artificial Intelligence Spring 2011 Lecture 21: Perceptrons 4/13/2010 Announcements Project 4: due Friday. Final Contest: up and running! Project 5 out! Pieter Abbeel UC Berkeley Many slides adapted

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

On Biased Reservoir Sampling in the Presence of Stream Evolution

On Biased Reservoir Sampling in the Presence of Stream Evolution Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

Data Stream Clustering Using Micro Clusters

Data Stream Clustering Using Micro Clusters Data Stream Clustering Using Micro Clusters Ms. Jyoti.S.Pawar 1, Prof. N. M.Shahane. 2 1 PG student, Department of Computer Engineering K. K. W. I. E. E. R., Nashik Maharashtra, India 2 Assistant Professor

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden

Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden Data Stream Mining Tore Risch Dept. of information technology Uppsala University Sweden 2016-02-25 Enormous data growth Read landmark article in Economist 2010-02-27: http://www.economist.com/node/15557443/

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

A Two-level Learning Method for Generalized Multi-instance Problems

A Two-level Learning Method for Generalized Multi-instance Problems A wo-level Learning Method for Generalized Multi-instance Problems Nils Weidmann 1,2, Eibe Frank 2, and Bernhard Pfahringer 2 1 Department of Computer Science University of Freiburg Freiburg, Germany weidmann@informatik.uni-freiburg.de

More information

Classification of Concept Drifting Data Streams Using Adaptive Novel-Class Detection

Classification of Concept Drifting Data Streams Using Adaptive Novel-Class Detection Volume 3, Issue 9, September-2016, pp. 514-520 ISSN (O): 2349-7084 International Journal of Computer Engineering In Research Trends Available online at: www.ijcert.org Classification of Concept Drifting

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Ms. Ritu Dr. Bhawna Suri Dr. P. S. Kulkarni (Assistant Prof.) (Associate Prof. ) (Assistant Prof.) BPIT, Delhi BPIT, Delhi COER, Roorkee

Ms. Ritu Dr. Bhawna Suri Dr. P. S. Kulkarni (Assistant Prof.) (Associate Prof. ) (Assistant Prof.) BPIT, Delhi BPIT, Delhi COER, Roorkee Journal Homepage: NOVEL FRAMEWORK FOR DATA STREAMS CLASSIFICATION APPROACH BY DETECTING RECURRING FEATURE CHANGE IN FEATURE EVOLUTION AND FEATURE S CONTRIBUTION IN CONCEPT DRIFT Ms. Ritu Dr. Bhawna Suri

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2015 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows K-Nearest

More information

CS 584 Data Mining. Classification 1

CS 584 Data Mining. Classification 1 CS 584 Data Mining Classification 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline Bias

More information

Data Mining Classification - Part 1 -

Data Mining Classification - Part 1 - Data Mining Classification - Part 1 - Universität Mannheim Bizer: Data Mining I FSS2019 (Version: 20.2.2018) Slide 1 Outline 1. What is Classification? 2. K-Nearest-Neighbors 3. Decision Trees 4. Model

More information

Support Vector Machines + Classification for IR

Support Vector Machines + Classification for IR Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information