CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE

Size: px
Start display at page:

Download "CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE"

Transcription

1 CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE In work educational data mining has been used on qualitative data of students and analysis their performance using C4.5 decision tree algorithm. The results indicate that student s performance also influenced by qualitative data. Acquired knowledge in form of tree is easy to assimilate by users. They take an attempt to look into the higher educational domain of data mining to analyze the students performance. Decision tree induction is one of conjoint approaches for extracting knowledge from sets of feature-based examples. Using machine-learning technique can develop a tool, which can help to predict [90] performance on the basis of data. Machine learning techniques have been positively applied in various fields such as medical science, pattern recognition, image recognition, and various control applications etc. [99]. In this machine learning method studied function represented by decision tree Introduction Data mining commonly expressed as the method of determining significant patterns in large size of data. Data mining deals a great variety of techniques, methods and tools for thorough analysis of available data in various fields [53]. Data mining term uses for this purpose because as mine rocks for a valuable are same as mine valuable information in a large database. It is, however, a contradiction, since mining for gold in rocks is usually called gold mining and not rock mining, thus by analogy, data mining should have been called knowledge mining instead. However, data mining also known as knowledge discovery in databases (KDD) that describes a more complete process. Supplementary terms that referring to data mining are: data dredging, knowledge extraction and pattern discovery. In origin, data mining is not specific to one type of media or data. Data mining ought be applicable to any kind of information storehouse. However, approaches may differ when applied to different 39

2 types of data. Indeed, the challenges presented by different types of data vary significantly [72]. As higher education ability of predicting a student s performance is very important to enhance their quality. In higher education institutions an ample amount of knowledge is hidden and can be extracting. The knowledge can be any student specific information like success rate, academic performance, dropouts rate, course preference, subject specialization, placement success etc. [99]. The quality of the students in a higher education institution is classified by their academic performance. Many factors influence the students performance like financial condition, living location, parents qualification, socio economic, non-academic and academic etc. Various data mining techniques are useful for deriving hidden knowledge from these factors. The technique behind the extraction of the hidden knowledge is knowledge discovery process that extracts the knowledge from available dataset and should create a knowledge base for the benefit of the institution [101]. The factors that describe student performance can be used for predicting students performance. For prediction can use a number of well - known data mining classification algorithms such as ID3, Simple CART, J48, NB Tree, and C4.5 etc. The model is mainly focused on finding the prediction accuracy of academic performance of students using two different datasets. The experimental model also proves that the student attributes considered are highly influential in predicting the results. The performed research work focuses on the development of data mining models for predicting students performance in higher education for classification. This work is done on a small dataset with a number of attributes to analyze the performance of the students. Feature selection has been an effective field of research area in machine learning, statistics and data mining communities [99]. Various attribute selection methods do exists to identify the attributes that make great impact. For such an environment there is the scope for the research investigating the efficiency of machine learning techniques. Data mining combines tools from statistics and machine learning with database management. Data mining can be defined as the process that starting from apparently unstructured data tries to extract knowledge and/or unknown interesting patterns. During this process machine learning algorithms are used. 40

3 Machine learning Machine learning (ML) is the science which evolved to develop algorithm /model. According to Arthur Samuel machine learning gives capability to learn without explicitly programmed. Basically machine learning is a set of tools that teach to computers about the paradigm. With the help of machine learning technique a tool construct, which can robotically predict [90] suitability of particular course for a student on the basis of data. Here we like to mention that, machine-learning techniques have been positively applied in various fields such as medical science, pattern recognition, image recognition, and various control applications etc. [99]. In this machine learning method studied function represented by decision tree. From the artificial intelligence point of view learning is central to human knowledge and intelligence. It is also essential for building intelligent machines. From the software engineering point of view machine learning allows us to program computers, which can be easier than writing code from the traditional way. Beyond the typical statistics problems machine learning has been applied to a vast number of problems. Machine learning is often designed with different considerations than statistics (e.g., speed is often more important than accuracy). Machine learning methods are classified into two phases: A model is developed from a collection of training data i.e. Training and the model is used to make decisions about some new test data. Data mining may apply machine-learning techniques; it may also drive the advancement of machine learning techniques or algorithms [38]. To utilize machinelearning algorithms one has to formulate the problem in their domain to what it expects, usually a set of features. Machine learning can be categorized in three classes [81]: 1. Supervised Learning: This is basically learning for classification or concept, in this the training data is labeled with the appropriate response. Classification and regression is most common type of supervised learning. 41

4 2. Unsupervised learning: Clustering and association is the common unsupervised learning in which given a collection of unlabeled data. Work analyzes and discovers patterns for unlabeled data. 3. Reinforcement learning: In which robot or controller seeks to learn the optimal actions to take based the outcomes of past actions. From the data mining point of view machine learning is research areas of computer science that is quickly grew due to the advances in data analysis research. ML also create place in database industry that are efficient of extracting valuable knowledge from large data stores. The most recurrently deliberate problem by data mining and machine learning academics is classification. It consists of predicting the value of categorical attribute i.e. class based on the values of predicting attributes. There are different classification methods. Machine learning approaches can be categorized from the data-mining point of view into two dissimilar clutches: Symbolic approaches and statistical approach. Inductive learning of symbolic descriptions, such as rules, decision trees or logical representations [81], Statistical approaches follows the pattern-recognition methods, including k-nearest neighbor, bayesian classifiers neural network learning and support vector machines Decision-making People can use knowledge for decision-making. Classification and prediction are two common method of data analysis. Also can use these method for describing significant class and prediction. Decision tree is very popular for classification and prediction model because it does not require any domain knowledge and parameter setting [38]. The decision tree method has mostly used because of its high accuracy of classifying the data set [11]. Decision tree is used for classification. For example suppose have tuple X that is associated with class label. The attribute values of tuple are tested against decision tree algorithm. Every branch of the tree represents the class prediction for that tuple [38]. Decision tree technique use top-down approach. Root 42

5 node of a decision tree play main role from root node each node split recursively according to algorithm. Tree generated with a training set of tuples and resultant node associated with class labels. The commonly used algorithms for building a decision tree are ID3, C4.5 and CART. To implementing data mining classification technique, different tools are available like Rapid Miner, WEKA, and TANAGRA etc. These Tools also help to predict the student s academic performance for future prospects. Therefore, use an illustrative algorithm for one of the most common machine learning techniques namely Decision Trees [38]. Work uses C4.5 decision tree learning method, which is suitable for discrete valued function. In this work using C4.5 Algorithm that is a successor to ID3 [75]. The most commonly used machine learning algorithms is C4.5. It handles discrete valued to build a decision tree. C4.5 distributes the attribute values into two partitions such that all the values that are above the root are treated as one child and the rest are treated as another child. Missing attribute values are also handled by it. For attribute selection C4.5 uses gain ratio to build a decision tree. When there are many outcome values of an attribute then it removes the partialities of information gain [82]. In the partitioning process in ID3 each level use statistical property known as information gain. Using information gain can determine best attribute for training set. C4.5 is the successor of ID3 used an extension of information gain, which is gain ratio [38]. It mentioned above that the decision tree is a top-down approach, but the difficulty is select attribute to split at each node. Have to best split the target class into the purest children nodes. To measure this purity of children nodes is called the information and gain represented by the amount of information. Gain Ratio: The process of selecting a new attribute and subdividing the training examples will be repeated for each non-terminal successor node. Attributes that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree [11]. For attribute selection can use gain ratio but before calculating gain ratio have to calculate split information. For an attribute A the information gain, Gain (S, A) that is relative to a collection of examples S, is defined as and values are in Table

6 n = p i log2( pi ) Info (D) (1) [38]. i= 1 And Gain(A) = Info(D) Info A(D)...(2)[38]. n St St Split Information (S, A) = log 2 S S...(3)[38]. i= 1 So gain Ratio will be Gain( A) Gain Ratio (S, A) = SplitInformation( S, A)... (4). In the above equation A is set of categorical attribute and using A, splitinfo (S, A) which is the information of S can be calculated which shows in Table 3.4. Calculated gain values are display in Table Illustrative example The data set value for work is depicted in Table 3.1. In work are taking parentq for parent qualification, location as loc, grade that will describe grade or division of student for previous passing class and suitable as decision label. Decision label has two values yes or no i.e. student is suitable for admission in computer course. In attribute list grade is calculated on the basis of percentage. IF perc>=60%, then grade=first,>=50%,second, otherwise third. Take 150 data. Example set illustrate in Figure 3.1 From equation (1) calculate information gain of suitable for discrete valued function, in which studied where 110 is yes and 27 is no. 44

7 Info(D) = - (110/138)log 2 (110/138) - (27/138)log 2 (27/138) = Table 3.1: List of attributes Attributes parentq Values {Educated,Uneducated} loc {urban,rural} grade {First,Second,Third} suitable {yes,no} Similarly calculate information gain for all attributes. For parent there is two class educated and uneducated. For educated class information gain is Info (educated) = 58/138 (-53/58log 2 53/58-5/58log 2 5/58) =.1627 For uneducated class information gain is Info (uneducated)=79/138(-57/79log 2 57/79-22/79log 2 22/79) =.4885 Info (parent) = Info (educated) + Info (uneducated) = =

8 location attribute consist two values urban and rural for both information gain is Info (urban)=64/138(- 44/64log 2 44/64-20/64log 2 20/64) =.4155 Info (rural)=73/138(- 66/73 log 2 66/73-7/73 log 2 7/73) =.2411 Info (loc)=info (urban) + Info (rural) = =.6567 Figure 3.1: Example set 46

9 grade attribute has 3 classes first, second and third. For all classes will calculate information gain as mention above Info (grade)=info (first) + Info (second) + Info (third) = 66/138(-63/66 log 2 63/66-6/66 log 2 6/66)+ 43/138(-25/43 log 2 25/43-18/43log 2 18/43)+ 28/138(-22/28 log 2 22/28-6/28 log 2 6/28) =.5853 Consolidate Information gain is depicted in Table 3.2. Table 3.2: Information Gain Info(A) Value Info(parentq) Info (location) Info (grade) From equation (2) will get gain value. Gain values of all attributes are shown in Table 3.3. Gain(parentq) = Info(D) - Info parentq (D) = =

10 Gain(location)=Info(D) - Info location (D) = =.0640 Gain(grade)=Info(D) - Info grade (D) = =.1354 Table 3.3: Gain Value Gain Value Gain(parentq) Gain(location) Gain(grade) Split information of decision attribute will be calculated from equation (4). Table 3.4 displays split information for 3 attributes. Split(suitable,parent)= -(59/138)*log 2 (59/138) - (79/138)*log 2 (79/138) = Split(suitable,location)= -(73/138)*log 2 (73/138)-(64/138)*log 2 (64/138) =

11 Split(suitable,grade)= -(66/138)*log 2 (66/138)-(43/138)*log 2 (43/138)- 28/138*log 2 (28/138) = Table 3.4: Split information of the sample Split Information Value Split(S,parentq) Split(S,location) Split(S,grade) Gain Ratio (suitable, parentq) =0.0695/ = Gain Ratio (suitable, location) =0.064/ = Gain Ratio (suitable,grade) =0.1354/ =

12 Table 3.5: Gain Ratio Gain Ratio Value GainRatio(S,parentq) GainRatio(S,location) GainRatio(S,grade) The gain ratio is shown in Table 3.5. Grade attribute has the highest gain ratio; therefore it is selected as the root node in tree. In sample data C4.5 split the data table based on the value of grade of students. Further will repeat above process to select node till reach the decision node [11] Steps involve in modeling To extract any knowledge or Mining Knowledge from data set is known as data mining. Steps for extract educational knowledge using data mining technique are as follows; and show it through Figure 3.2. Data classification problems may concentrate by data cleaning. Data cleaning provide outline of main solution approaches. Real world data collected for mining tend to be unclean. It may be noise, inconsistent and incomplete [78]. 50

13 Select Data Data refinement Data Modeling Evaluation Depolyment Figure 3.2 Steps for extract knowledge Data Refinement: Data refining process refines dissimilar data to increase the understanding of the data; it removes data inconsistency and redundancy. Data refinement process can develop an integrated data resource [38]. Data source can be multiple it may be data warehouses, federated database systems or web-based information systems, so integrated these data may need for data cleaning to increases significantly because sources often contain redundant data in different representations. Data refining process may be completed after integration of different dataset depending on the database or data warehousing implementation. Inconsistent data are the raw material but integrated data resource is the final product. 51

14 Refinement process may involve two steps data cleaning and data transformation. Data cleaning process to integrate and transform is heterogeneous data sources. Data cleaning raise the data quality to which is necessary for analysis. Data cleaning deal with incomplete, missing, non-existent value. Specifically, filtering the problematic data can introduce sample bias into the data and using data overlays could introduce missing values [69]. Data warehouses load and constantly refresh huge amounts of data from a variety of sources so there is high probability of containing unclean data [97]. Moreover, data warehouses support to decisionmaking, so that the correctness of data is vibrant to avoid wrong conclusions. For instance, duplicated or missing information will produce incorrect or misleading statistics ( garbage in, garbage out ). Due to the wide range of possible data inconsistencies and the sheer data volume, data cleaning is considered to be one of the solutions of biggest problems in data warehousing. Change the collective data needs to transform in required format. As know data may be qualitative or quantitate. Convert data from one form to other may call data transformation. Some techniques require a specific form of data. Therefore, data preparation phase is needed. Data transformation includes data preparation operations such as the convert data production of derived attributes, entire new records, or transformed values for existing attributes [78]. Uses excel for refinement process. convert students percent into grade according to traditional method. Data Modeling: Designing a model for extracting knowledge from database is called data modeling. In modeling phase, several modeling techniques are selected and applied. Purpose of data modeling is to recognize all entities that data have. It then defines a relationship between these entities. It can be conceptual, logical or physical data models. Conceptual data modeling typically identifies the highest-level relationships between different entities where as enterprise data modeling similar to conceptual data modeling, but addresses the unique requirements [53]. Logical data modeling illustrates the specific entities, attributes and relationships involved in a 52

15 business function. Serves as the basis for the creation of the physical data model. Physical data modeling represents an application and database-specific implementation of a logical data model. The first step in modeling is selecting the actual modeling technique to be used; this task refers to selecting the specific modeling technique, e.g., building decision trees or generating a neural network etc. Prior to building a model, a procedure needs to be defined to test the model s quality and validity [53]. The main goal of modeling is constancy that means when apply the model on unseen data then will show true value. Evaluation: Before final deployment of the model, it is necessary model evaluation thoroughly. Review the steps executed to construct the model is also needed, to be achieves the objectives. A key objective is to determine if there is some important issue that has not been considered sufficiently. At the end of evaluation user will achieve results purpose of the use data mining. Evaluation steps deals with factors like the accuracy and overview of the model [70]. This step measures the degree to which the model achieves objectives. Evaluation process determines factors due to which model may become deficient. For the evaluation confusion matrix is best for prediction [53]. A confusion matrix or classification matrix is used to appraise the prediction accuracy of a model. It evaluates whether a model whether the model is making mistakes in its predictions if yes then what is the percentage. Numerous classification rules are used to generating a confusion matrix [53]. Almost all performance metrics are represented in terms of the elements of the confusion matrix generated by the model on a test sample. The format of a confusion matrix for a two-class case with class yes and no is shown in table 3.6. A column represents an actual class, while a row represents the predicted class. The total number of instances in the test set is represented on the top of the table (P=total number of positive instances, and N=total number of negative instances), while the number of instances predicted to belong to each class are represented to the left of the table (p= total number of instances classified as positive; n=total number of instances classified as negative). TP (true positives) is the number of correctly classified positive examples. In a similar manner, FN (false negatives) is 53

16 the number of positive examples classified as negative, TN (true negatives) the number of correctly classified negative examples and, finally, FP (false positives) the negative examples for which the positive class was predicted. The positive class s rate represent by (TPrate). TPrate = TP/(TP+FN). The corresponding negative class is measured by the true negative rate (TNrate), and it is calculated as the number of negative examples correctly identified, out of all negative samples. TNrate=TN/(TN+FP). It is also important to evaluate also how many examples, which are identified as belonging to a given class actually belongs to assume class. This calculation is done with the help of positive and negative predicted values [53]. The positive predicted value (PPV): PPV=TP/(TP+FP), while the negative predictive value (NPV) represents the number of negatives correctly identified out of all examples classified as negative, NPV=TN/(TN+FP). The TPrate, TNrate, PPV and NPV indicate some true occurrences, which need to be maximized; sometimes their complements are more interesting. All these parameters provide a more exact view on the performance of a classification method. Measurements, and focus on those alone, or provide a composite metric which serves the given objective the best. Table 3.6: Confusion Matrix Predicted class Actual Class Yes (P) No (N) Yes True Positive (TP) False Positive (FP) No False Negative (FN) True Negative (TN) 54

17 The actual vales in a confusion matrix are often represented as percentages. Whether or not a confusion matrix is good depends on the costs of misclassification [18]. In model building confusion matrix plays an important role. Calculate it in further section. Steps that following are 1. For Data collection select Educational Environment (for this task select an educational organization i.e. SSSSMV, Bhilai, C.G.) 2. For mining select relevant data (using admitted students data) 3. Remove inconsistent or remove noisy data and apply treatment about incomplete and erroneous data. 4. Apply data transformation into modified data (after it data transforming into a new format). 5. Apply data mining and extract meaningful information from training set (Are applying decision tree technique). 6. Evaluate extracted information/result Tool use in experiment: WEKA Waikato environment tool use for knowledge learning is known as WEKA [36]. WEKA developed at the university of Waikato in New Zealand, it is a computer program. WEKA implemented using JAVA that s why it is simple portable & platform independent. The main purpose to develop WEKA was identifying information from data set. For data analysis and predictive modeling workplace of WEKA is the collection of visualization tools and algorithms. WEKA provides a good graphical user interfaces [61]. Collected qualitative data for experiment and 10-fold cross validation applied. Have chosen WEKA [47]. WEKA tool supported.csv Format of data so has been entered and saved in excel.csv format. WEKA can perform several standard data 55

18 mining tasks like clustering, classification, data preprocessing, association, visualization, and feature selection. The WEKA s graphical environment provides explorer, experimenter, knowledge flow, and simple CLI applications.c4.5 algorithm yields acceptable level of accuracy through WEKA. The decision tree generated by WEKA tool is depicted in following Figure Output This chapter is focused on how can enhance quality of education in higher education. As this work mentioned if recognize potential of students then improvement is possible. For the work student s qualitative data has been patronized using decision tree have been visualized in Figure 3.3. Student s grades as root node with the branches of first, second and third. Root node selected on the basis of higher gain ration. It shows by decision tree that course is suitable for all the students with first grade. For second and third grade repeat the same process and for second grade higher information gain is for location so if students get second grade then examine on the basis of location. In tree Parent s qualification and living location taken as branch node and so on. Apply 10fold cross validation, it is a way of reducing the variance of data set. With cross-validation, divide it just once, but in 10 folds divide into, 10 pieces. For training uses 9 of the pieces, and the last piece use for testing. Perform the whole thing for 10 times and every time use different segment for testing. That would be 10-fold cross-validation [26]. Table 3.7 shows contingency table or confusion matrix. The number of correctly classified instances is the sum of diagonals in the matrix (19+91=110); remaining all are incorrectly classified instances [62]. It also shows accuracy information of model. 56

19 Table 3.7: Accuracy Information of C4.5 Correctly Classified Instances % Incorrectly Classified Instances % Table 3.8 displays confusion matrix about it mentioned 3.3. The True Positive (TP) rate is the proportion of class, in the confusion matrix, this is the diagonal element divided by the sum over the relevant row, i.e. 19 / (19+91) = for class yes and 28 /(0+28) = 1.0 for class no in our example True positive rate shows in Table 3.9. The false positive (FP) rate is the proportion of class, but belong to a different class, In the matrix, this is the column sum of class minus the diagonal element, divided by the rows sums of all other classes; i.e. 0/110 =0.0 for class yes and 63/110 = for class no, Which shows in Table 3.9 [49]. 57

20 Table 3.8: Confusion Matrix Suitable C4.5 Yes No Yes Class No 0 28 Table 3.9: Class Accuracy Class label TP Rate FP Rate Yes NO

21 Figure 3.3:Decision tree! 59

22 3.6. Conclusion Predicting students academic performance is a great concern to the higher education system. Data mining can be used in a higher educational system to predict the students academic performance. This work conducts a study to predict student s performance for a particular course like BCA, MCA or any computer course. This is done with student s qualitative data to show the influence in student s performance using machine learning technique decision tree. This concludes that student s performance is affected by qualitative factors. Machine learning has come extreme from its promising stages, and can prove to be a powerful tool in academia. In the future, applications similar to the one developed, as well as any improvements thereof may become an integrated part of every academic institution. The success of any educational organization is mainly dependent on the results it produces in terms of student success rate [66]. This work successfully derived a prediction mechanism for the success of student s course wise, social status and grade wise. The method has been proved to be effective from correctly predicted result is 94% approximately. However, the method helps the college managements to improve their teaching learning process and academic activities midway through the course in order to improve their performance. In future can improve this technique by adding some more qualitative data like hobbies, financial help, caste, attendance, and sports ability. This technique also can be used in any educational organization, institution to predict performance of students and they can improve their result and also reduce dropout rate of students. This work has been showing the accuracy of model through classification matrix or confusion matrix. Accuracy table sows model accuracy is greater then 50% that means model can predict true value on satisfactory rate. 60

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting

More information

Classification and Regression

Classification and Regression Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error INTRODUCTION TO MACHINE LEARNING Measuring model performance or error Is our model any good? Context of task Accuracy Computation time Interpretability 3 types of tasks Classification Regression Clustering

More information

K- Nearest Neighbors(KNN) And Predictive Accuracy

K- Nearest Neighbors(KNN) And Predictive Accuracy Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Part I. Instructor: Wei Ding

Part I. Instructor: Wei Ding Classification Part I Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition Given a collection of records (training set ) Each record contains a set

More information

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Machine Learning Dr.Ammar Mohammed Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A Store the training samples Use training samples

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Missing Value Imputation in Multi Attribute Data Set

Missing Value Imputation in Multi Attribute Data Set Missing Value Imputation in Multi Attribute Data Set Minakshi Dr. Rajan Vohra Gimpy Department of computer science Head of Department of (CSE&I.T) Department of computer science PDMCE, Bahadurgarh, Haryana

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees Dominik Vinan February 6, 2018 Abstract Decision Trees are a well-known part of most modern Machine Learning toolboxes.

More information

CS4491/CS 7265 BIG DATA ANALYTICS

CS4491/CS 7265 BIG DATA ANALYTICS CS4491/CS 7265 BIG DATA ANALYTICS EVALUATION * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Dr. Mingon Kang Computer Science, Kennesaw State University Evaluation for

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Optimizing Completion Techniques with Data Mining

Optimizing Completion Techniques with Data Mining Optimizing Completion Techniques with Data Mining Robert Balch Martha Cather Tom Engler New Mexico Tech Data Storage capacity is growing at ~ 60% per year -- up from 30% per year in 2002. Stored data estimated

More information

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Warehousing and Mining Lecture 4 by Hossen Asiful Mustafa Classification: Definition Given a collection of records (training set

More information

Chapter 6 Evaluation Metrics and Evaluation

Chapter 6 Evaluation Metrics and Evaluation Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Definitions Data. Consider a set A = {A 1,...,A n } of attributes, and an additional

More information

Data Mining: STATISTICA

Data Mining: STATISTICA Outline Data Mining: STATISTICA Prepare the data Classification and regression (C & R, ANN) Clustering Association rules Graphic user interface Prepare the Data Statistica can read from Excel,.txt and

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

NETWORK FAULT DETECTION - A CASE FOR DATA MINING NETWORK FAULT DETECTION - A CASE FOR DATA MINING Poonam Chaudhary & Vikram Singh Department of Computer Science Ch. Devi Lal University, Sirsa ABSTRACT: Parts of the general network fault management problem,

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): 2321-0613 A Study on Handling Missing Values and Noisy Data using WEKA Tool R. Vinodhini 1 A. Rajalakshmi

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

Data Mining and Analytics

Data Mining and Analytics Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/

More information

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry Data Mining Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University of Iowa Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel. 319-335 5934

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes and a class attribute

More information

A Novel Approach to Compute Confusion Matrix for Classification of n-class Attributes with Feature Selection

A Novel Approach to Compute Confusion Matrix for Classification of n-class Attributes with Feature Selection A Novel Approach to Compute Confusion Matrix for Classification of n-class Attributes with Feature Selection V. Mohan Patro 1 and Manas Ranjan Patra 2 Department of Computer Science, Berhampur University,

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Seminars of Software and Services for the Information Society

Seminars of Software and Services for the Information Society DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Master of Science in Engineering in Computer Science (MSE-CS) Seminars in Software and Services for the Information Society

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila

More information

Tanagra: An Evaluation

Tanagra: An Evaluation Tanagra: An Evaluation Jessica Enright Jonathan Klippenstein November 5th, 2004 1 Introduction to Tanagra Tanagra was written as an aid to education and research on data mining by Ricco Rakotomalala [1].

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface Data Mining: i STATISTICA Outline Prepare the data Classification and regression Clustering Association rules Graphic user interface 1 Prepare the Data Statistica can read from Excel,.txt and many other

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

Impact of Encryption Techniques on Classification Algorithm for Privacy Preservation of Data

Impact of Encryption Techniques on Classification Algorithm for Privacy Preservation of Data Impact of Encryption Techniques on Classification Algorithm for Privacy Preservation of Data Jharna Chopra 1, Sampada Satav 2 M.E. Scholar, CTA, SSGI, Bhilai, Chhattisgarh, India 1 Asst.Prof, CSE, SSGI,

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD Khin Lay Myint 1, Aye Aye Cho 2, Aye Mon Win 3 1 Lecturer, Faculty of Information Science, University of Computer Studies, Hinthada,

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

DATA MINING LECTURE 9. Classification Decision Trees Evaluation DATA MINING LECTURE 9 Classification Decision Trees Evaluation 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium

More information

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry Data Mining Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University it of Iowa Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel. 319-335

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION 6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

Building Intelligent Learning Database Systems

Building Intelligent Learning Database Systems Building Intelligent Learning Database Systems 1. Intelligent Learning Database Systems: A Definition (Wu 1995, Wu 2000) 2. Induction: Mining Knowledge from Data Decision tree construction (ID3 and C4.5)

More information

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

AMOL MUKUND LONDHE, DR.CHELPA LINGAM International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol. 2, Issue 4, Dec 2015, 53-58 IIST COMPARATIVE ANALYSIS OF ANN WITH TRADITIONAL

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data

More information

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery? Data Mining (Big Data Analytics) Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University of Iowa Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://user.engineering.uiowa.edu/~ankusiak/

More information

Model s Performance Measures

Model s Performance Measures Model s Performance Measures Evaluating the performance of a classifier Section 4.5 of course book. Taking into account misclassification costs Class imbalance problem Section 5.7 of course book. TNM033:

More information

Data Collection, Preprocessing and Implementation

Data Collection, Preprocessing and Implementation Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA. Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan

More information

Chapter 4: Text Clustering

Chapter 4: Text Clustering 4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information