Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

Size: px

Start display at page:

Download "Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation."

Tracey Parks
5 years ago
Views:

1 CSE3212 Data Mining Data Mining Approaches Defining a Data Mining Task To define a data mining task, one needs to answer the following questions: 1. What data set do I want to mine? 2. What kind of knowledge do I want to mine? 3. What background knowledge could be useful? 4. How do I measure if the results are interesting? 5. How do I display what I have discovered? Task-relevant Data Generally we wish to mine only a subset of a database, not the whole database. It may be that we only want to study something specific e.g. trends in postgraduate students countries they come from; degree program they are doing; their age; time (duration) that they taken to finish the degree; and Have they been awarded scholarship? Building the database subset may be a subtask before data mining can be done. What to be mined? Or the Approaches What kind of knowledge we are after? Classification Estimation Prediction Clustering Description Affinity Grouping Outliers Classification Classification involves considering the features of some object then assigning it it to some pre-defined class, for example: Spotting fraudulent insurance claims Which phone numbers are fax numbers Which customers are high-value The features that are considered are known as the independent attributes or variables while the attribute that constitute the pre-defined classes is called as the dependent attribute/variable. First build a model based on the known data and use the model to classify other data for which the class label is not known known as supervised learning Estimation Estimation deals with numerically valued outcomes rather than discrete categories as occurs in classification. Estimating the number of children in a family Estimating family income

2 Prediction Essentially the same as classification and estimation but involves future behaviour Historical data is used to build a model explaining behaviour (outputs) for known inputs The model developed is then applied to current inputs to predict future outputs Predict which customers will respond to a promotion Classifying loan applications Clustering Clustering is also sometimes referred to as segmentation (though this has other meanings in other fields) In clustering there are no pre-defined classes. Selfsimilarity is used to group records. The user must attach meaning to the clusters formed Clustering often precedes some other data mining task, for example: once customers are separated into clusters, a promotion might be carried out based on market basket analysis of the resulting cluster Known as un-supervised learning Description A good description of data can provide understanding of behaviour The description of the behaviour can suggest an explanation for it as well Statistical measures can be useful in describing data, as can techniques that generate rules Deviation Detection Records whose attributes deviate from the norm by significant amounts are also called outliers Application areas include: fraud detection quality control tracing defects. Visualization techniques and statistical techniques are useful in finding outliers A cluster which contains only a few records may in fact represent outliers Affinity Grouping Affinity grouping is also referred to as Market Basket Analysis A common example is the discovery of which items are frequently sold together at a supermarket. If this is known, decisions can be made about: arranging items on shelves which items should be promoted together which items should not simultaneously be discounted Rule Body Market Basket Analysis Confidence When a customer buys a shirt, in 70% of cases, he or she will also buy a tie! We find this happens in 13.5% of all purchases. Rule Head Support

3 The Usefulness of Market Basket Analysis Some rules are useful: Unknown, unexpected and indicative of some action to take. Some rules are trivial: Known by anyone familiar with the business. Some rules are inexplicable: Seem to have no explanation and do not suggest a course of action. The key to success in business is to know something that nobody else knows Aristotle Onassis Co-Occurrence Table Customer Items 1 orange juice (OJ), cola 2 milk, orange juice, window cleaner 3 orange juice, detergent 4 orange juice, detergent, cola 5 window cleaner, cola OJ Cleaner Milk Cola Detergent OJ Cleaner Milk Cola Detergent From the Co-Occurrence Table We can say that people who buys Orange Juice also will buy Cola ( or detergent). orange juice cola This association rule is satisfied by 2 out of 5 customers ( 1 and 4) hence support is 2/5 = 40% However, there are four customers (1,2,3 and 4) have purchased orange juice and hence the confidence of the above rule is only 2/4 = 50% Question: Are support and confidence measures good enough? The rule has one item (or attribute) on the left hand side and the right hand side. How do you find rules which has more than one items on the left hand side (multi-attribute rule) Support and Confidence Support: Percentage of transactions from a transaction database that the given rule satisfies. This can be taken as the probability P(X Y) where X Y indicates that a transaction contains both X and Y, that is union of item sets X and Y. Confidence: Which assess the degree of certainty of the detected association. This can be taken as the conditional probability P(Y X), that is, the probability that a transaction containing X also contains Y. More formally Support (X Y ) = P (X Y) Confidence (X Y) = P (Y X) What is a Rule? If condition then result Note: If nappies and Thursday then beer is usually better than (in the sense that it is more actionable) If Thursday then nappies and beer because it has just one item in the result Is the Rule a Useful Predictor? - 1 Confidence is the ratio of the number of transactions with all the items in the rule to the number of transactions with just the items in the condition. Consider: if B and C then A If this rule has a confidence of 0.33, it means that when B and C occur in a transaction, there is a 33% chance that A also occurs. If a 3 way combination is the most common, then consider rules with just 1 item in the result, e.g. If A and B, then C If A and C, then B

4 Is the Rule a Useful Predictor? - 2 Consider the following table of probabilities of items and there combinations: Combination Probability A 0.45 B 0.42 C 0.40 A and B 0.25 A and C 0.20 B and C 0.15 A and B and C 0.05 Is the Rule a Useful Predictor? - 3 Now consider the following rules: Rule p(condition) p(condition confidence and result) If A and B then C If A and C then B If B and C then A It is tempting to choose If B and C then A, because it is the most confident (33%) - but there is a problem Is the Rule a Useful Predictor? - 4 This rule is actually worse than just saying that A randomly occurs in the transaction - which happens 45% of the time A measure called improvement indicates whether the rule predicts the result better than just assuming the result in the first place Is the Rule a Useful Predictor? - 5 Improvement measures how much better a rule is at predicting a result than just assuming the result in the first place When improvement > 1, the rule is better at predicting the result than random chance Improvement = p(condition and result) p(condition)p(result) Is the Rule a Useful Predictor? - 6 Consider the improvement for our rules: Rule support confidence improvement If A and B then C If A and C then B If B and C then A If A then B None of the rules with three items shows any improvement - the best rule in the data actually has only two items: if A then B. A predicts the occurrence of B 1.31 times better than chance. Is the Rule a Useful Predictor? - 7 When improvement < 1, negating the result produces a better rule. For example if B and C then not A has a confidence of 0.67 and thus an improvement of 0.67/0.55 = 1.22 Negated rules may not be as useful as the original association rules when it comes to acting on the results

5 Choosing the Right Set of Items Choosing the right level of detail (the creation of classes and a taxonomy) Virtual items may be added to take advantage of information that goes beyond the taxonomy Anonymous versus signed transactions Multi-attribute Rule For 2 items on the left hand side and one item on the right hand side of a rule (e.g. If A and B then C) would require the co-occurrence matrix to be 3-dimensional. How do you visualise three dimensional co-occurrence matrix? What happens for higher dimensions? The Process for Market Basket Analysis An Example A co-occurrence cube would show associations in three dimensions - hard to visualize more We must: Choose the right set of items Generate rules by deciphering the counts in the cooccurrence matrix Overcome the practical limits imposed by many items in large numbers of transactions Consider the following database: Student(sid, name1, dob, country, degree, startsem, address1, telephone, address2, , scholarship,..) Enrolment(sid, subject-id, mark, tutegroup, tutor,..) Subject(sub-id, name, school-id, whenstarted, lecturer,..) School(name, id,..) Not all of this data is needed for decision making. Let us extract some data from this database Example We could look at the information as yob X country X degree X startsem X numsubjects X scholarship In fact it is natural to think of an enterprise data as multidimensional. yob, country, degree, startsem, numsubjects, scholarship 1965, Thailand, MIT, 991, 5, 25% 1970, Canada, BIT, 992, 4, , Australia, LLB, 993, 3, 30% 1966, Australia, LLB, 983, 4, 40% 1972, Australia, Bcom, 973, 5, 10% 1972, India, BIT/Bcom, 991, 5, 10% 1982, Sweden, MSc(IT), 991, 3, 10% Is this information useful for decision making? Not really!

6 Example Example The university management may be interested in retrieving information like: How many students are doing BIT? How many students from Thailand? How many students started in 1998? (queries involving only one variable) How many students doing BIT are from Thailand? How many MIT students started in 981? How many students from Thailand started in 993? (queries involving two variables) How many students doing MIT from Thailand started in 981? (query involving three variables) Special type of database systems, called data cube systems, are often used for answering such queries The example queries discussed earlier may be represented by a three-dimensional data cube with each edge representing one of the variables viz. startsem, country, and degree. A point inside the cube is an intersection of the coordinates defined by the edges of the cube. The coordinates of the point define the meaning of the data at that point. Let us look at a simple two-dimensional situation: country X degree For decision making this may be useful information. If we had a 2-dimensional matrix then we could find out the number of students for any country (x) and any degree (y) But in the two-dimensional situation, we don t just want to find out the number of students for any country (x) and any degree (y). We may have many other queries e.g. 1. How many students are doing MIT? 2. How many students from Thailand? 3. How many Asian students doing Law degrees? Thus there is kind of hierarchy that we wish to use, for example, the world, the continents, the regions, the countries etc. In degrees, we may want a hierarchy of university, Schools, UG and PG, individual degrees. Consider a slightly more complex situation in which we have three dimensions: country X degree X startsem for any country (x), any degree (y) and any start semester (z). We may now look at this information as a 3- dimensional cube as shown on the following slide

7 A Sample Number of students as a function of country, degree and semester degree Dimensions: country, degree, sem Hierarchical summarization paths continent school Year LLB BComp MIT Sum degree semester sum Total enrolments U.S.A Malaysia Australia Country country region ug/pg country degree semester sum semester Each edge of the cube is called a dimension. A user normally has a number of different dimensions from which the given data may be analyzed. A user therefore has a multidimensional conceptual view of the data which is represented by the cube. The points inside a cube provide aggregations. For example, a point may provide the number of students from Malaysia admitted to BComp in year Strengths and Weaknesses Strengths Clear understandable results Supports undirected data mining Works on variable length data Is simple to understand Weaknesses Requires exponentially more computational effort as the problem size grows Suits items in transactions but not all problems fit this description It can be difficult to determine the right set of items to analysis It does not handle rare items well; simply considering the level of support will exclude these items We need an algorithm to find the association rules Outlier Analysis Outlier analysis identifies data objects that do not comply with the general behaviour or model of the data. Often outliers are ignored but in applications like fraud detection the outliers are the objects of interest 2.41

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,