CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)
|
|
- Allen Robbins
- 5 years ago
- Views:
Transcription
1 CMPUT 391 Database Management Systems Data Mining Textbook: Chapter (without 17.10) University of Alberta 1
2 Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 2
3 The Current Situation Data Rich Technology is available to help us collect data Bar code, scanners, satellites, cameras, etc. Technology is available to help us store data Databases, data warehouses, variety of repositories Information Poor We need powerful data analysis processes to turn the data into information (competitive edge, research, etc.) Sales / Service Transactions Sky and Earth Observation Molecular Biology Medical Data... University of Alberta 3
4 Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 4
5 KDD: Knowledge Discovery in Databases Confluence of Multiple Disciplines Machine Learning Statistics Database Technology KDD Visualization... Process of non trivial extraction of implicit, previously unknown and potentially useful information from large collections of data University of Alberta 5
6 The KDD Process Data mining: the core of knowledge discovery process. Pattern Evaluation Task-relevant Data Data Cleaning Data Warehouse Selection and Transformation KDD is an Iterative Process Data Integration University of Alberta 6
7 Basic Data Mining Functionalities Association Rules Find frequent associations in transaction databases Clustering Find natural groups in the data Classification Find models to predict class membership for new objects Concept Characterization and Discrimination Generalize, summarize, and contrast data characteristics Other methods Outlier detection Sequential patterns Methods for special data types, e.g., spatial data mining, web mining, text mining, University of Alberta 7
8 Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 8
9 Basic Concepts I ={i 1, i 2, i n } is the set of all possible items A transaction is a set of items: T={i a, i b, i t }, T I D, is a set (database) of transactions the task relevant data. An association rule is of the form: P Q where P I, Q I, and P Q = University of Alberta 9
10 Rule Measures: Support and Confidence Support of a rule P Q - s D (P Q) = s D (P Q): Probability that a transaction in D contains the items from both itemsets P and Q. Customer buys both Customer buys bread Confidence of a rule P Q - c D (P Q) = s D (P Q) / s D (P): Probability that a transaction in D that already contains the items in P also contains the items in Q. Determined by counting the corresponding relative frequencies in the database Customer buys milk University of Alberta 10
11 Support and Confidence An Example Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F For rule A C: support = support({a, C}) = 50% Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% confidence = support({a, C})/support({A}) = 66.6% University of Alberta 11
12 Strong Rules Thresholds: minimum support: minsup minimum confidence: minconf Itemsets are called frequent if their support is greater or equal than minsup Task: Find all rules P Q [s, c] so that s D (P Q) minsup c D (P Q) minconf University of Alberta 12
13 Mining Association Rules Input A database of transactions Each transaction is a list of items (Ex. purchased by a customer in a visit) Task: Find Find all rules P Q [s, c] that associate the presence of one set of items with that of another set of items so that sd(p Q) minsup cd(p Q) minconf Example: 98% of people who purchase tires and auto accessories also get automotive services done University of Alberta 13
14 Mining Frequent Itemsets: the Key Step Iteratively find the frequent itemsets, i.e. sets of items that have minimum support, with cardinality from 1 to k (k-itemsets) Based on the Apriori principle: Any subset of a frequent itemset must also be frequent. E.g., if {AB} is a frequent itemset, both {A} and {B} must be frequent itemsets. Method based on the apriori principle When generating candidate k-itemsets for the next iteration, only consider those where all subsets of length k -1 have been determined as frequent in the previous step Use the frequent itemsets to generate association rules. University of Alberta 14
15 The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ; University of Alberta 15
16 Candidate Generation: 1. Join Step Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2,, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1,, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 p L k-1 (1, 2, 3) q L k-1 (1, 2, 4) (1, 2, 3, 4) C k University of Alberta 16
17 Candidate Generation: 2. Prune Step Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k Example L 3 = {(1 2 3), (1 2 4), (1 3 4), (1 3 5), (2 3 4)} Candidates after the join step: {( ), ( )} In the pruning step: delete ( ) because (3 4 5) L 3 C 4 = {( )} University of Alberta 17
18 The Apriori Algorithm -- Example Database D TID Items L 2 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 C 3 itemset {2 3 5} C 1 Scan D C 2 itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Scan D L 3 L 1 Scan D itemset sup {2 3 5} 2 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 C 2 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} minsup = 50% Note: {1,2,3}{1,2,5} and {1,3,5} not in C 3 University of Alberta 18
19 Generating Association Rules from Frequent Itemsets Only strong association rules are generated. Frequent itemsets satisfy minimum support threshold. Strong AR satisfy minimum confidence threshold. Confidence(P Q) = Prob(Q/P) = Support(P Q) Support(P) For each frequent itemset, f, generate all non-empty subsets of f. For every non-empty subset s of f do output rule s (f-s) if support(f)/support(s) min_confidence end University of Alberta 19
20 Interestingness of Association Rules Motivation Example (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal not cereal sum(col.) University of Alberta 20
21 Interestingness of Association Rules Delete misleading association rules Condition for a rule A B P( A B) P( A) > P( B) + Measure for the interestingness of a rule P( A B) P( B) P( A) The larger the value, the more interesting the relation between A and B, expressed by the rule. Other measures: correlation between A and B d for a suitable threshold d > 0 University of Alberta 21
22 Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 22
23 What is Clustering in Data Mining? Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters. Helps users understand the natural grouping or structure in a data set. Cluster: a collection of data objects that are similar to one another and thus can be treated collectively as one group. Clustering: unsupervised classification: no predefined classes. University of Alberta 23
24 What Is Good Clustering? A good clustering method will produce high quality clusters where: the intra-class similarity (that is within a cluster) is high. the inter-class similarity (that is between clusters) is low. The quality of a clustering result also depends on the similarity measure used by the method. The quality of a clustering result also depends on the definition and representation of cluster different clustering algorithms may have different underlying notions of clusters. University of Alberta 24
25 Major Clustering Techniques Partitioning algorithms: Construct various partitions and then evaluate them by some criterion. Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. There is an agglomerative approach and a divisive approach. Density-based: based on connectivity and density functions. Grid-based: based on a multiple-level granularity structure. Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other. University of Alberta 25
26 Partitioning Algorithms: Basic Concept Partitioning method: Given a number k, partition a database D of n objects into a set of k clusters so that a chosen objective function is minimized (e.g., sum of distances to the center of the clusters). Global optimum: exhaustively enumerate all partitions too expensive! Heuristic methods based on iterative refinement of an initial partition Bad Clustering Optimal Clustering x x x x Centroids x x x x Centroids University of Alberta 26
27 The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps: 1. Partition objects into k nonempty subsets 2. Compute seed points as the centroids of the clusters of the current partition. The centroid of a cluster for the k-means algorithm is the mean point of all points in the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment University of Alberta 27
28 Comments on the K-Means Method Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n. Weakness of the k-means: Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance. Not suitable to discover clusters with non-convex shapes. Often terminates at a local optimum. University of Alberta 28
29 Hierarchical Clustering Hierarchical decomposition of the data set (with respect to a given similarity measure) into a set of nested clusters Result represented by a so called dendrogram Nodes in the dendrogram represent possible clusters can be constructed bottom-up (agglomerative approach) or top down (divisive approach) Step 0 Step 1 Step 2 Step 3 Step 4 a b c d e a b d e a b c d e c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative divisive University of Alberta 29
30 Hierarchical Clustering: Example Interpretation of the dendrogram The root represents the whole data set A leaf represents a single objects in the data set An internal node represent the union of all objects in its sub-tree The height of an internal node represents the distance/similarity between its two child nodes distance between groups University of Alberta 30
31 Agglomerative Hierarchical Clustering Single-Link Method and Variants: start by placing each object in its own cluster. keep merging closest pairs (most similar pairs) of clusters into larger clusters until all objects are in a single cluster. Most hierarchical methods belong to this category. They differ mainly in their definition of betweencluster similarity. Single-Link: similarity is defined as the similarity between the closest (i.e., most similar) pair of objects. University of Alberta 31
32 Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 32
33 What is Classification? The goal of data classification is to organize and categorize data in distinct classes. A model is first created based on the data distribution. The model is then used to classify new data. Given the model, a class can be predicted for new data.? n University of Alberta 33
34 Classification is a three-step process 1. Model construction (Learning): Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label. Training Set: Outlook Tempreature Humidity W indy Class sunny hot high false N sunny hot high true N overcast hot high false P The set of all tuples used for construction of the model is called training set. The model can be represented in the following forms: Classification rules, (IF-THEN statements), Decision tree Mathematical formulae University of Alberta 34
35 1. Classification Process (Learning) Training Data Classification Algorithms Name Income Age Credit rating Bruce Low <30 bad Dave Medium [30..40] good William High <30 good Marie Medium >40 good Anne Low [30..40] good Chris Medium <30 bad Classifier (Model) IF Income = High OR Age > 30 THEN CreditRating = Good University of Alberta 35
36 Classification is a three-step process 2. Model Evaluation (Accuracy): Test Set: Estimate accuracy rate of the model based on a test set. The known label of test sample is compared with the classified result from the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model. Test set is independent of training set otherwise over-fitting will occur. Outlook Tempreature Humidity W indy Class rainy hot low false P sunny hot high true N overcast hot high false P = Predicted Class? = P? = P? = N? University of Alberta 36
37 2. Classification Process (Accuracy Evaluation) Testing Data Classifier (Model) Name Income Age Credit rating Tom Medium <30 bad Jane High <30 bad Wei High >40 good Hua Medium [30..40] good How accurate is the model? IF Income = High OR Age > 30 THEN CreditRating = Good University of Alberta 37
38 Classification is a three-step process 3. Model Use (Classification): The model is used to classify new objects where the class label is not known. Using the attributes of the new object and the model, assign a class label to the new object Outlook Tempreature Humidity W indy Class New Data: rainy hot low false? Predict Class! rainy mild high false? overcast hot high true? University of Alberta 38
39 3. Classification Process (Classification) New Data Classifier (Model) Name Income Age Credit rating Paul High [30..40]? Credit Rating? University of Alberta 39
40 Classification Methods Decision Tree Induction Neural Networks Bayesian Classification Association-Based Classification K-Nearest Neighbour Genetic Algorithms Etc. University of Alberta 40
41 What is a Decision Tree? Atr=? CL A decision tree is a flow-chart-like tree structure. Internal node denotes a test on an attribute Branch represents an outcome of the test All tuples in branch have the same value for the tested attribute. Leaf node represents class label or class label Atr=? distribution. Atr=? Atr=? Atr=? CL CL CL CL CL CL CL University of Alberta 41
42 Training Dataset An Example from Quinlan s ID3 Outlook Tempreature Humidity W indy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N University of Alberta 42
43 A Sample Decision Tree sunny Humidity? Outlook? overcast P rain Windy? Outlook Tempreature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N high normal true false N P N P University of Alberta 43
44 Decision-Tree Classification Methods The basic top-down decision tree generation approach usually consists of two phases: 1. Tree construction At the start, all the training examples are at the root. Partition examples recursively, based on selected attributes. 2. Tree pruning Aiming at removing tree branches that may reflect noise in the training data and lead to errors when classifying test data improve classification accuracy. University of Alberta 44
45 Decision Tree Construction Atr=? Recursive process: Tree starts a single node representing all data. Recursion stops when: a) Sample in node belong to the same class; b) There are no remaining attributes on which to split; a) & b) node becomes a leaf labeled with the majority class label. There are no samples with attribute value. CL Otherwise, select suitable attribute partition the data according to the attribute values of the selected attribute into subsets. For each of these subsets: create a new child node under the current parent node and recursively apply the method to the new child nodes. University of Alberta 45
46 Partitioning the Data at a Given Node Split criterion: Use a goodness/impurity function to determine the attribute that results in the purest subsets with respect to the class label. Different goodness functions exist: information gain, gini index, etc. Branching scheme: binary splitting (numerical attributes, gini index) versus many splitting (categorical attributes, information gain). University of Alberta 46
47 Example for Algorithm (ID3) All attributes are categorical Create a node N; if samples are all of the same class C, then return N as a leaf node labeled with C. if attribute-list is empty then return N as a leaf node labeled with the most common class. Select split-attribute with highest information gain label N with the split-attribute for each value A i of split-attribute, grow a branch from Node N let S i be the branch in which all tuples have the value A i for split- attribute if S i is empty then attach a leaf labeled with the most common class. Else recursively run the algorithm at Node S i Until all branches reach leaf nodes University of Alberta 47
48 How to use a tree? Directly test the attribute values of an unknown sample against the tree. A path is traced from root to a leaf which holds the label. Indirectly decision tree is converted to classification rules. one rule is created for each path from the root to a leaf. IF-THEN rules are easier for humans to understand. University of Alberta 48
What Is Data Mining? CMPT 354: Database I -- Data Mining 2
Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT
More informationAssociation Rules. Berlin Chen References:
Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A
More informationSupervised and Unsupervised Learning (II)
Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised
More informationANU MLSS 2010: Data Mining. Part 2: Association rule mining
ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements
More informationData warehouse and Data Mining
Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationAssociation Rule Mining. Entscheidungsunterstützungssysteme
Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
More informationChapter 4 Data Mining A Short Introduction
Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview
More informationClassification with Decision Tree Induction
Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree
More informationChapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1
Chapter 4 Data Mining A Short Introduction 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationInterestingness Measurements
Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationBasic Data Mining Technique
Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm
More informationInterestingness Measurements
Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected
More informationApriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the
More informationChapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the
Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule
More informationLecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics
More informationCompSci 516 Data Intensive Computing Systems
CompSci 516 Data Intensive Computing Systems Lecture 20 Data Mining and Mining Association Rules Instructor: Sudeepa Roy CompSci 516: Data Intensive Computing Systems 1 Reading Material Optional Reading:
More informationData Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1
Data Mining: Concepts and Techniques Chapter 5 SS Chung April 5, 2013 Data Mining: Concepts and Techniques 1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road
More informationCSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:
CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem Muhammad Asiful Islam, SBID: 106506983 Original Data Outlook Humidity Wind PlayTenis Sunny High Weak No Sunny
More informationUnsupervised: no target value to predict
Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning
More informationFrequent Pattern Mining
Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2 Burnt or Burned? E. Aiden and J-B
More informationChapter 4: Mining Frequent Patterns, Associations and Correlations
Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent
More informationCourse Content. Outline of Lecture 10. Objectives of Lecture 10 DBMS & WWW. CMPUT 499: DBMS and WWW. Dr. Osmar R. Zaïane. University of Alberta 4
Technologies and Applications Winter 2001 CMPUT 499: DBMS and WWW Dr. Osmar R. Zaïane Course Content Internet and WWW Protocols and beyond Animation & WWW Java Script Dynamic Pages Perl Intro. Java Applets
More informationData Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3
Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?
More informationIntroduction to Data Mining
Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data
More information9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)
Comp 135 Introduction to Machine Learning and Data Mining Our first learning algorithm How would you classify the next example? Fall 2014 Professor: Roni Khardon Computer Science Tufts University o o o
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationProduct presentations can be more intelligently planned
Association Rules Lecture /DMBI/IKI8303T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, Objectives Introduction What is Association Mining? Mining Association Rules
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationData Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier
Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio
More informationThanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a
Data Mining and Information Retrieval Introduction to Data Mining Why Data Mining? Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently
More informationRule induction. Dr Beatriz de la Iglesia
Rule induction Dr Beatriz de la Iglesia email: b.iglesia@uea.ac.uk Outline What are rules? Rule Evaluation Classification rules Association rules 2 Rule induction (RI) As their name suggests, RI algorithms
More informationBCB 713 Module Spring 2011
Association Rule Mining COMP 790-90 Seminar BCB 713 Module Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Outline What is association rule mining? Methods for association rule mining Extensions
More informationCse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University
Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationAn Improved Apriori Algorithm for Association Rules
Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan
More informationData Mining Algorithms
for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester
More informationCOMP 465: Data Mining Classification Basics
Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised
More informationData Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application
Data Structures Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali 2009-2010 Association Rules: Basic Concepts and Application 1. Association rules: Given a set of transactions, find
More information2 CONTENTS
Contents 5 Mining Frequent Patterns, Associations, and Correlations 3 5.1 Basic Concepts and a Road Map..................................... 3 5.1.1 Market Basket Analysis: A Motivating Example........................
More informationUNIT-IV DATA MINING ALGORITHMS. March 14, 2012 Prof. Asha Ambhaikar
UNIT-IV DATA MINING ALGORITHMS March 14, 2012 Prof. Asha Ambhaikar 1 What Is Association Mining? Association rule mining: ARM is based on Transactional database It is also Finding frequent patterns, associations,
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationAssociation Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12
Association Rules Charles Sutton Data Mining and Exploration Spring 2012 Based on slides by Chris Williams and Amos Storkey The Goal Find patterns : local regularities that occur more often than you would
More informationNesnelerin İnternetinde Veri Analizi
Bölüm 4. Frequent Patterns in Data Streams w3.gazi.edu.tr/~suatozdemir What Is Pattern Discovery? What are patterns? Patterns: A set of items, subsequences, or substructures that occur frequently together
More informationMining Association Rules in Large Databases
Mining Association Rules in Large Databases Association rules Given a set of transactions D, find rules that will predict the occurrence of an item (or a set of items) based on the occurrences of other
More informationBasic Concepts Weka Workbench and its terminology
Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know
More informationExtra readings beyond the lecture slides are important:
1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their
More informationLecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic
SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association
More informationIntroduction to Data Mining. Yücel SAYGIN
Introduction to Data Mining Yücel SAYGIN ysaygin@sabanciuniv.edu http://people.sabanciuniv.edu/~ysaygin/ A Brief History Historically, we had operational databases, ex: for accounts, customers, personnel
More informationData Mining and Analytics
Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/
More informationAssociation mining rules
Association mining rules Given a data set, find the items in data that are associated with each other. Association is measured as frequency of occurrence in the same context. Purchasing one product when
More informationNesnelerin İnternetinde Veri Analizi
Nesnelerin İnternetinde Veri Analizi Bölüm 3. Classification in Data Streams w3.gazi.edu.tr/~suatozdemir Supervised vs. Unsupervised Learning (1) Supervised learning (classification) Supervision: The training
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationReading Material. Data Mining - 2. Data Mining - 1. OLAP vs. Data Mining 11/19/17. Four Main Steps in KD and DM (KDD) CompSci 516: Database Systems
Reading Material CompSci 56 Database Systems Lecture 23 Data Mining and Mining Association Rules Instructor: Sudeepa Roy Optional Reading:. [RG]: Chapter 26 2. Fast Algorithms for Mining Association Rules
More informationMachine Learning Chapter 2. Input
Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat
More informationData Mining Clustering
Data Mining Clustering Jingpeng Li 1 of 34 Supervised Learning F(x): true function (usually not known) D: training sample (x, F(x)) 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0
More informationTutorial on Association Rule Mining
Tutorial on Association Rule Mining Yang Yang yang.yang@itee.uq.edu.au DKE Group, 78-625 August 13, 2010 Outline 1 Quick Review 2 Apriori Algorithm 3 FP-Growth Algorithm 4 Mining Flickr and Tag Recommendation
More informationCOMP33111: Tutorial and lab exercise 7
COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised
More informationMining Association Rules in Large Databases
Mining Association Rules in Large Databases Vladimir Estivill-Castro School of Computing and Information Technology With contributions fromj. Han 1 Association Rule Mining A typical example is market basket
More informationKeywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering
More informationCompSci 516 Data Intensive Computing Systems
CompSci 516 Data Intensive Computing Systems Lecture 25 Data Mining and Mining Association Rules Instructor: Sudeepa Roy Due CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 1 Announcements
More informationA Review on Cluster Based Approach in Data Mining
A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,
More informationData Mining Part 3. Associations Rules
Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationINTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá
INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús
More informationAssociation Pattern Mining. Lijun Zhang
Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms
More informationProduction rule is an important element in the expert system. By interview with
2 Literature review Production rule is an important element in the expert system By interview with the domain experts, we can induce the rules and store them in a truth maintenance system An assumption-based
More informationRoad map. Basic concepts
Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationData Mining Concepts
Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential
More informationData Clustering Hierarchical Clustering, Density based clustering Grid based clustering
Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationChapter 6: Association Rules
Chapter 6: Association Rules Association rule mining Proposed by Agrawal et al in 1993. It is an important data mining model. Transaction data (no time-dependent) Assume all data are categorical. No good
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationDecision Tree CE-717 : Machine Learning Sharif University of Technology
Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationData Mining Practical Machine Learning Tools and Techniques
Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,
More informationChapter 28. Outline. Definitions of Data Mining. Data Mining Concepts
Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns
More informationIntroduction to Machine Learning
Introduction to Machine Learning Decision Tree Example Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short} Class: Country = {Gromland, Polvia} CS4375 --- Fall 2018 a
More informationCHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.
119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched
More informationMachine Learning: Symbolische Ansätze
Machine Learning: Symbolische Ansätze Unsupervised Learning Clustering Association Rules V2.0 WS 10/11 J. Fürnkranz Different Learning Scenarios Supervised Learning A teacher provides the value for the
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules
More informationAssociation Rules Apriori Algorithm
Association Rules Apriori Algorithm Market basket analysis n Market basket analysis might tell a retailer that customers often purchase shampoo and conditioner n Putting both items on promotion at the
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 6
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013-2017 Han, Kamber & Pei. All
More informationLecture 5: Decision Trees (Part II)
Lecture 5: Decision Trees (Part II) Dealing with noise in the data Overfitting Pruning Dealing with missing attribute values Dealing with attributes with multiple values Integrating costs into node choice
More informationAssociation rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)
Association rules Marco Saerens (UCL), with Christine Decaestecker (ULB) 1 Slides references Many slides and figures have been adapted from the slides associated to the following books: Alpaydin (2004),
More informationAssociation Rules. A. Bellaachia Page: 1
Association Rules 1. Objectives... 2 2. Definitions... 2 3. Type of Association Rules... 7 4. Frequent Itemset generation... 9 5. Apriori Algorithm: Mining Single-Dimension Boolean AR 13 5.1. Join Step:...
More informationBITS F464: MACHINE LEARNING
BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,
More informationData Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.
Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan
More informationRepresenting structural patterns: Reading Material: Chapter 3 of the textbook by Witten
Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter
More information