Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Size: px
Start display at page:

Download "Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques"

Transcription

1 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE University of Economic Studies, Bucharest, Romania During recent years, the amounts of data, collected and stored by organizations on a daily basis, have been growing constantly. These large volumes of data need to be analyzed, so organizations need innovative new solutions for extracting the significant information from these data. Such solutions are provided by data mining techniques, which apply advanced data analysis methods for discovering meaningful patterns within the raw data. In order to apply these techniques, such as Naïve-Bayes classifier, data needs to be preprocessed and transformed, to increase the accuracy and efficiency of the algorithms and obtain the best results. This paper focuses on performing a comparative analysis of the forecasting performance obtained with the Naïve-Bayes classifier on a dataset, by applying different data discretization methods opposed to running the algorithms on the initial dataset. Keywords: Discretization, Naïve-Bayes classifier, Data mining, Performance 1 Introduction Nowadays, organizations collect large amounts of data every day. These data need to be analyzed in order to find the meaningful information contained by it and reach the best conclusions, to support decision making. Data mining is an innovative new solution, which provides the required tools for processing the data in order to extract significant patterns and trends. Before running data mining algorithms against it, the raw data needs to be cleaned and transformed. This is accomplished through the preliminary steps of the Knowledge Discovery in Databases process data preprocessing and data transformation. One of the key methods used during data transformation is data discretization. Discretization methods transform the continuous values of a dataset attribute to discrete ones. It can help improve significantly the forecasting performance of classification algorithms, like Naïve Bayes, that are sensitive to the dimensionality of the data. Naïve-Bayes is an intuitive data mining algorithm that predicts class membership, using the probabilities of each attribute value to belong to each class. Discretization methods needs to be applied on datasets before analyzing them, in order to transform the continuous variables to discrete variables and, thus, to improve the accuracy and efficiency of the classification algorithm. 2. Naïve-Bayes classifiers: overview Classification is a fundamental issue in machine learning and statistics. It is a supervised data mining technique, with the goal of accurately predicting the class label for each item in a given dataset. A classification model built to predict class labels, from the attributes of the dataset, is known as a classifier. In data mining, Bayesian classifiers are a family of probabilistic classifiers, based on applying Bayes' theorem. The theorem, named after Reverend Thomas Bayes ( ), who has greatly contributed to the field probability and statistics, is a mathematical formula used for calculating conditional probabilities. It relates current probability to prior probability. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Studies comparing

2 Database Systems Journal vol. VI, no. 2/ classification algorithms have found a simple Bayesian classifier known as the naïve Bayesian classifier to be comparable in performance with decision tree and selected neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases. [1] The Naïve-Bayes classifier is an intuitive data mining method that uses the probabilities of each attribute value belonging to each class to predict class membership. A Bayesian classifier is stated mathematically as the following equation: [1] P(X C i)p(c i) P(C i X) = (1) P(X) where, P(C i X) is the probability of dataset item X belonging to class C i ; P(X C i ) is the probability of generating dataset item X given class C i ; P(C i ) is the probability of occurrence of class C i ; P(X) is the probability of occurrence of dataset item X. Naïve-Bayes classifiers simplify the computation of probabilities by assuming that the probability of each attribute value to belong to a given class label is independent of all the other attribute values. This method goes by the name of Naïve Bayes because it s based on Bayes rule and naïvely assumes independence it is only valid to multiply probabilities when the events are independent. The assumption that attributes are independent (given the class) in real life certainly is a simplistic one. But despite the disparaging name, Naïve Bayes works very effectively when tested on actual datasets, particularly when combined with some of the attribute selection procedures. [2] 3 Discretization techniques: theoretical framework The Knowledge Discovery in Databases (KDD) process is an iterative process for identifying valid, new and significant patterns in large and complex datasets. The core step of the KDD process is data mining, which involves developing the model for discovering patterns and trends in the data. Data preprocessing and data transformation are crucial steps of the KDD process. After performing them better data should be generated, in a form suitable for the data mining algorithms. Data transformation methods include dimension reduction (such as feature selection and extraction, and record sampling), and attribute transformation (such as discretization of numerical attributes and functional transformation). [3] Most experimental datasets have attributes with continuous values. However, data mining techniques often need that the attributes describing the data are discrete, so the discretization of the continuous attributes before applying the algorithms is important for producing better data mining models. The goal of discretization is to reduce the number of values a continuous attribute assumes by grouping them into a number, n, of intervals (bins). [4] Mainly there are two tasks of discretization. The first task is to find the number of discrete intervals. Only a few discretization algorithms perform this; often, the user must specify the number of intervals or provide a heuristic rule. The second task is to find the width, or the boundaries, of the intervals given the range of values of a continuous attribute. [5] Data discretization comprises a large variety of methods. They can be classified based on how the discretization is performed into: supervised vs. unsupervised, global vs. local, static vs. dynamic, parametric vs. nonparametric, hierarchical vs. non-hierarchical etc. There are several methods that can be used for data discretization. Supervised discretization methods can be divided into

3 26 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques error-based, entropy-based or statistics based. Among the unsupervised discretization methods there are the ones like equal-width and equal-frequency. Equal-width discretization This method consists of sorting the values of the dataset and dividing them into intervals (bins) of equal range. The user specifies k, the number of intervals to be calculated, then the algorithm determined the minimum and maximum values and divides the dataset into k intervals. Equal-width interval discretization is a simplest discretization method that divides the range of observed values for a feature into k equal sized bins, where k is a parameter provided by the user. The process involves sorting the observed values of a continuous feature and finding the minimum, V min and maximum, V max, values. [5] The method divides the dataset into k intervals of equal size. The width of each interval is calculated using the following formula: V max V min Width = (2) k The boundaries of the intervals are calculated as: V min, V min +Width, V min +2Width,..., V min +(k-1)width, V max. The limitations of this method are given by the uneven distribution of the data points: some intervals may contain much more data points than other. [5] Equal-frequency discretization This method is based on dividing the dataset into intervals containing the same number of items. Partitioning of data is based on allocating the same number of instances to each bin. The user supplies k, the number of intervals to be calculated, then the algorithm divides n, the total number of items belonging to the dataset, by k. Equal-Frequency Discretization predefines k, the number of intervals. It then divides the sorted values into k intervals so that each interval contains approximately the same number of training instances. Suppose there are n training instances, each interval then contains n/k training instances with adjacent (possibly identical) values. [3] The method divides the dataset into k intervals with equal number of instances. The intervals can be computed using the following formula: n Interval = (3) k Equal-frequency binning can yield excellent results, at least in conjunction with the Naïve Bayes learning scheme, when the number of bins is chosen in a data-dependent fashion by setting it to the square root of the number of instances. [2] Entropy-based discretization One of the supervised discretization methods, introduced by Fayyad and Irani, is called the entropy-based discretization. An entropy-based method will use the class information entropy of candidate partitions to select boundaries for discretization. [5] The method calculates the entropy based on the class labels and finds the best split-points, so that most of the values in an interval fit the same class label the split-points with the maximal information gain. The entropy function for a given set S is calculated using the formula: [5] Info(S) = - pi log 2 (pi) (4) Based on this entropy measure, the discretization algorithm can find potential split-points within the existing range of continuous values. The split-point with the lowest entropy is chosen to split the range into two intervals, and the binary split is continued with each part until a stopping criterion is satisfied. [5] Many other discretization methods may be applied on raw data, both supervised and unsupervised. Among the supervised methods we can mention Chi-Square based discretization, while a sophisticated

4 Database Systems Journal vol. VI, no. 2/ unsupervised method is k-means discretization, based on clustering analysis. Discretization techniques are generally considered to improve the forecasting performance of data mining techniques, particularly classification algorithms like Naïve-Bayes classifier, and, at the same time, it is thought that, choosing one discretization algorithm over another, influences the significance of the forecasting improvement. 4. Case study: Evaluating the performance of Naïve-Bayes classifiers on discretized datasets This case study focuses on presenting the experimental results obtained by forecasting the class label for the Credit Approval dataset, using the Naïve-Bayes classifier. The algorithm was applied to the original data, as well as to each transformed dataset, obtained by using each of the discretization methods described in this paper. The dataset used for the experimental study concerns credit card applications and it was obtained from UCI Machine Learning Repository [6]. The Credit Approval dataset comprises 690 instances, characterized by 15 attributes and a class attribute. The values of the class attribute in the dataset can be + (positive) or - (negative) and they indicate the credit card application status for each submitted application. This experimental study was performed using RapidMiner Software [7]. RapidMiner is a software platform, developed by the company of the same name, which provides support for all steps of the data mining process. RapidMiner Software supports data discretization through its discretization operators. Five discretization methods are provided by RapidMiner: Discretize by Binning, Discretize by Frequency, Discretize by Size, Discretize by Entropy and Discretize by User Specification. Among these, three methods were used during the case study, corresponding to the ones described in the paper: [7] Discretize by Binning this operator discretizes the selected numerical attributes into user-specified number of bins. Bins of equal range are automatically generated, the number of the values in different bins may vary; Discretize by Frequency this operator converts the selected numerical attributes into nominal attributes by discretizing the numerical attribute into a user-specified number of bins. Bins of equal frequency are automatically generated, the range of different bins may vary; Discretize by Entropy this operator converts the selected numerical attributes into nominal attributes. The boundaries of the bins are chosen so that the entropy is minimized in the induced partitions. During the experiment I defined a data mining process in RapidMiner. This process applied Naïve-Bayes classifier on the Credit Approval dataset, first on the original data and afterwards on the datasets obtained by applying each discretization method. I obtained performance indicators for each of these cases and I performed a comparative analysis on the results achieved without discretization and the results achieved with each discretization method. The process flow defined in RapidMiner, for applying the Naïve-Bayes classifier, is presented in figure 1:

5 28 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Fig. 1. Naïve-Bayes process flow in RapidMiner The process defined for evaluating the performance of Naïve-Bayes classifier includes the following RapidMiner operators: Read ARFF this operator is used for reading an ARFF file, in our case the file credit-a.arff; Set Role this operator is used to change the role of one or more attributes; Discretize this operator discretizes the selected numerical attributes to nominal attributes. RapidMiner supports applying all three discretization methods described in this paper: for equal-width discretization we can use Discretize by Binning, for equal-frequency discretization we can use Discretize by Frequency and for entropy-based discretization we can use Discretize by Entropy. Naïve-Bayes this operator generates a Naive Bayes classification model. A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be 'independent feature model. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature [7]; Apply Model this operator applies an already learnt or trained model, in this case Naïve-Bayes, on a dataset, for prediction purposes; Performance this operator is used for performance evaluation, through a number of statistical indicators. For my case study I chose to use accuracy, kappa and root mean squared error (RMSE). The process described above was executed first without the Discretize operator, on the original Credit Approval dataset, and afterwards including the Discretize operator, for each discretization method. By executing the results, in each case, the process generated results for evaluating the performance of the Naïve-Bayes classifier on the dataset. Discretization methods should increase the efficiency of Naïve- Bayes classifiers. For performance evaluation I compared the results obtained in terms of accuracy, kappa and root mean square error. In order to establish which discretization technique was better for transforming the dataset, the accuracy of the classification should be as higher, as well as kappa, while the root mean square error should be as low as possible. A comparative analysis of the performance achieved, for the Naïve-Bayes classifier, with each discretization method, is shown in table 1: Table 1. Comparative analysis of the Naïve-Bayes classifier performance with discretization methods Performance indicator Discretization method Accuracy Kappa Root mean squared error None 78.26% Equal-width discretization 87.83% Equal-frequency discretization 84.06% Entropy-based discretization 86.96%

6 Database Systems Journal vol. VI, no. 2/ Based on the results in the table above, it is obvious that applying discretization methods to the dataset, before running the Naïve-Bayes classifier, has significantly improved the performance of the classification algorithm and increased the accuracy of the results obtained. Among the discretization methods applied as part of the experimental study, equal-width discretization produces the lowest root mean squared error, 0.314, compared to the other two methods. Equal-frequency discretization generates the highest root mean squared error, of The prediction accuracy of the Naïve- Bayes classifier has the highest value when applying equal-width discretization %, while equal-frequency discretization has an accuracy of 84.06%. Performance indicator kappa compares the observed accuracy with the expected accuracy of the classifier. Thus, the best discretization method is the one generating the highest kappa. The discretization method producing the highest kappa, of 0.754, is equal-width discretization. Applying equal-frequency discretization generates the lowest kappa, My experiment compares the quality of the classification achieved through the Naïve-Bayes classifier, on the Credit Approval dataset, without performing discretization on the raw data and after applying discretization methods on the data. Through this experiment I am also comparing the performance obtained by applying each of the discretization methods, against each other. Based on the previous statements, all three discretization methods improve the quality of the classification, compared to running the classification on the initial dataset. Among the methods, the best quality classification is obtained by applying equal-width discretization, opposed to equal-frequency discretization, which generates the lowest quality classification. 5. Conclusions Discretization is a very important transformation step for data mining algorithms that can only handle discrete data. The results of my tests confirm that the performance of Naïve-Bayes classifier is improved when discretization methods are applied on the dataset used in the analysis. In this paper, I have studied the effect that applying different discretization methods, can have on the results obtained by performing a classification analysis, with the Naïve-Bayes classifier. The conclusion, based on the results obtained, is that applying the discretization methods prior to running the classification algorithm is beneficial for the analysis, since better performance indicators have been obtained on the discrete data. Based on the experimental results, I can assert that Naïve-Bayes classifier generates better results on discrete datasets and, also, that for the particular dataset we used, the most efficient discretization method was equal-width discretization, while equalfrequency discretization was the least efficient for improving the classification efficiency and accuracy of the Naïve-Bayes classifier. References [1]Jiawei Han, Micheline Kamber and Jian Pei Data Mining: Concepts and Techniques. Third Edition, Morgan Kaufmann Publishers, USA, 2011, ISBN [2]Ian H. Witten, Eibe Frank and Mark A. Hall - Data Mining: Practical Machine Learning Tools and Techniques. Third Edition, Morgan Kaufmann Publishers, USA, 2011, ISBN [3]Oded Maimon and Lior Rokach Data Mining and Knowledge Discovery Handbook. Second Edition, Springer Publisher, London, 2010, ISBN [4]Krzysztof J. Cios, Witold Pedrycz, Roman W. Swiniarski and Lukasz A. Kurgan Data Mining. A Knowledge Discovery Approach, Springer

7 30 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Publisher, USA, 2007, ISBN [5]Rajashree Dash, Rajib Lochan Paramguru and Rasmita Dash Comparative Analysis of Supervised and Unsupervised Discretization Techniques, International Journal of Advances in Science and Technology, Vol. 2, No. 3, 2011, ISSN [6]UCI Machine Learning Repository Credit Approval Data Set, Available online (May 16th, 2015): edit+approval. [7]RapidMiner Software, Available online (May 14th, 2015): Ruxandra PETRE graduated from the Faculty of Cybernetics, Statistics and Economic Informatics of the Bucharest University of Economic Studies in In 2012 she graduated the Business Support Databases Master program. Currently, she is a PhD candidate, coordinated by Professor Ion LUNGU in the field of Economic Informatics at the Bucharest University of Economic Studies. Her scientific fields of interest include: Databases, Data Warehouses, Business Intelligence, Decision Support Systems and Data Mining.

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES USING DIFFERENT DATASETS V. Vaithiyanathan 1, K. Rajeswari 2, Kapil Tajane 3, Rahul Pitale 3 1 Associate Dean Research, CTS Chair Professor, SASTRA University,

More information

Discretizing Continuous Attributes Using Information Theory

Discretizing Continuous Attributes Using Information Theory Discretizing Continuous Attributes Using Information Theory Chang-Hwan Lee Department of Information and Communications, DongGuk University, Seoul, Korea 100-715 chlee@dgu.ac.kr Abstract. Many classification

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD Khin Lay Myint 1, Aye Aye Cho 2, Aye Mon Win 3 1 Lecturer, Faculty of Information Science, University of Computer Studies, Hinthada,

More information

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

DISCRETIZATION BASED ON CLUSTERING METHODS. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

DISCRETIZATION BASED ON CLUSTERING METHODS. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania DISCRETIZATION BASED ON CLUSTERING METHODS Daniela Joiţa Titu Maiorescu University, Bucharest, Romania daniela.oita@utm.ro Abstract. Many data mining algorithms require as a pre-processing step the discretization

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Efficient SQL-Querying Method for Data Mining in Large Data Bases

Efficient SQL-Querying Method for Data Mining in Large Data Bases Efficient SQL-Querying Method for Data Mining in Large Data Bases Nguyen Hung Son Institute of Mathematics Warsaw University Banacha 2, 02095, Warsaw, Poland Abstract Data mining can be understood as a

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),

More information

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique Research Paper Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique C. Sudarsana Reddy 1 S. Aquter Babu 2 Dr. V. Vasu 3 Department

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Improved Discretization Based Decision Tree for Continuous Attributes

Improved Discretization Based Decision Tree for Continuous Attributes Improved Discretization Based Decision Tree for Continuous Attributes S.Jyothsna Gudlavalleru Engineering College, Gudlavalleru. G.Bharthi Asst. Professor Gudlavalleru Engineering College, Gudlavalleru.

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Cluster based boosting for high dimensional data

Cluster based boosting for high dimensional data Cluster based boosting for high dimensional data Rutuja Shirbhate, Dr. S. D. Babar Abstract -Data Dimensionality is crucial for learning and prediction systems. Term Curse of High Dimensionality means

More information

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya Volume 5, Issue 1, January 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Performance

More information

Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface

Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface A Comparative Study of Classification Methods in Data Mining using RapidMiner Studio Vishnu Kumar Goyal Dept. of Computer Engineering Govt. R.C. Khaitan Polytechnic College, Jaipur, India vishnugoyal_jaipur@yahoo.co.in

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM 4.1 Introduction Nowadays money investment in stock market gains major attention because of its dynamic nature. So the

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

Application of Data Mining in Manufacturing Industry

Application of Data Mining in Manufacturing Industry International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 3, Number 2 (2011), pp. 59-64 International Research Publication House http://www.irphouse.com Application of Data Mining

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): 2321-0613 A Study on Handling Missing Values and Noisy Data using WEKA Tool R. Vinodhini 1 A. Rajalakshmi

More information

Data Mining: An experimental approach with WEKA on UCI Dataset

Data Mining: An experimental approach with WEKA on UCI Dataset Data Mining: An experimental approach with WEKA on UCI Dataset Ajay Kumar Dept. of computer science Shivaji College University of Delhi, India Indranath Chatterjee Dept. of computer science Faculty of

More information

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis A Critical Study of Selected Classification s for Liver Disease Diagnosis Shapla Rani Ghosh 1, Sajjad Waheed (PhD) 2 1 MSc student (ICT), 2 Associate Professor (ICT) 1,2 Department of Information and Communication

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Unsupervised Discretization using Tree-based Density Estimation

Unsupervised Discretization using Tree-based Density Estimation Unsupervised Discretization using Tree-based Density Estimation Gabi Schmidberger and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {gabi, eibe}@cs.waikato.ac.nz

More information

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset M.Hamsathvani 1, D.Rajeswari 2 M.E, R.Kalaiselvi 3 1 PG Scholar(M.E), Angel College of Engineering and Technology, Tiruppur,

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach

More information

ECLT 5810 Evaluation of Classification Quality

ECLT 5810 Evaluation of Classification Quality ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann Testing and Error Error rate:

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Improving Imputation Accuracy in Ordinal Data Using Classification

Improving Imputation Accuracy in Ordinal Data Using Classification Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz

More information

Count based K-Means Clustering Algorithm

Count based K-Means Clustering Algorithm International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Count

More information

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Data Mining in Bioinformatics Day 1: Classification

Data Mining in Bioinformatics Day 1: Classification Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

PARALLEL CLASSIFICATION ALGORITHMS

PARALLEL CLASSIFICATION ALGORITHMS PARALLEL CLASSIFICATION ALGORITHMS By: Faiz Quraishi Riti Sharma 9 th May, 2013 OVERVIEW Introduction Types of Classification Linear Classification Support Vector Machines Parallel SVM Approach Decision

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES

A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES 1 Preeti lata sahu, 2 Ms.Aradhana Singh, 3 Mr.K.L.Sinha 1 M.Tech Scholar, 2 Assistant Professor, 3 Sr. Assistant Professor, Department of

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

Mr. Avinash R. Pingale, Ms. Aparna A. Junnarkar Department of Comp. Engg., Savitribai Phule UoP, Pune, Maharashtra, India

Mr. Avinash R. Pingale, Ms. Aparna A. Junnarkar Department of Comp. Engg., Savitribai Phule UoP, Pune, Maharashtra, India Volume 4, Issue 12, December 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com RaajHans:

More information

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

AMOL MUKUND LONDHE, DR.CHELPA LINGAM International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol. 2, Issue 4, Dec 2015, 53-58 IIST COMPARATIVE ANALYSIS OF ANN WITH TRADITIONAL

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Part I. Instructor: Wei Ding

Part I. Instructor: Wei Ding Classification Part I Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition Given a collection of records (training set ) Each record contains a set

More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

CSE 626: Data mining. Instructor: Sargur N. Srihari. Phone: , ext. 113

CSE 626: Data mining. Instructor: Sargur N. Srihari.   Phone: , ext. 113 CSE 626: Data mining Instructor: Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113 1 What is Data Mining? Different perspectives: CSE, Business, IT As a field of research in

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106,

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

An Empirical Study on feature selection for Data Classification

An Empirical Study on feature selection for Data Classification An Empirical Study on feature selection for Data Classification S.Rajarajeswari 1, K.Somasundaram 2 Department of Computer Science, M.S.Ramaiah Institute of Technology, Bangalore, India 1 Department of

More information

Discretization and Grouping: Preprocessing Steps for Data Mining

Discretization and Grouping: Preprocessing Steps for Data Mining Discretization and Grouping: Preprocessing Steps for Data Mining Petr Berka 1 and Ivan Bruha 2 z Laboratory of Intelligent Systems Prague University of Economic W. Churchill Sq. 4, Prague CZ-13067, Czech

More information

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization GLOBAL JOURNAL OF ENGINEERING SCIENCE AND RESEARCHES APPLICATION OF CLASSIFICATION TECHNIQUES TO DETECT HYPERTENSIVE HEART DISEASE Tulasimala B. N* 1, Elakkiya S 2 & Keerthana N 3 *1 Assistant Professor,

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Conceptual Review of clustering techniques in data mining field

Conceptual Review of clustering techniques in data mining field Conceptual Review of clustering techniques in data mining field Divya Shree ABSTRACT The marvelous amount of data produced nowadays in various application domains such as molecular biology or geography

More information

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER N. Suresh Kumar, Dr. M. Thangamani 1 Assistant Professor, Sri Ramakrishna Engineering College, Coimbatore, India 2 Assistant

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

An Efficient Clustering for Crime Analysis

An Efficient Clustering for Crime Analysis An Efficient Clustering for Crime Analysis Malarvizhi S 1, Siddique Ibrahim 2 1 UG Scholar, Department of Computer Science and Engineering, Kumaraguru College Of Technology, Coimbatore, Tamilnadu, India

More information

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18,   ISSN International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 2321-3469 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING Srikanth Bethu

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 4, April 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Discovering Knowledge

More information

An Approach for Discretization and Feature Selection Of Continuous-Valued Attributes in Medical Images for Classification Learning.

An Approach for Discretization and Feature Selection Of Continuous-Valued Attributes in Medical Images for Classification Learning. An Approach for Discretization and Feature Selection Of Continuous-Valued Attributes in Medical Images for Classification Learning. Jaba Sheela L and Dr.V.Shanthi Abstract Many supervised machine learning

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method. IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY IMPROVED ROUGH FUZZY POSSIBILISTIC C-MEANS (RFPCM) CLUSTERING ALGORITHM FOR MARKET DATA T.Buvana*, Dr.P.krishnakumari *Research

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information