An Introduction to Data Mining

An Introduction to Data Mining Hossein Hakimzadeh Computer and Information Sciences Data Mining (B561) 1

What Is Data Mining? Original Definition: "data mining" was a statistician's term for overusing data to draw invalid inferences. Bonferroni's theorem: If there are too many possible conclusions to draw, some will be true for purely statistical reasons, with no physical validity. Data Mining (B561) 2

What Is Data Mining? David Rhine, a "parapsychologist" at Duke in the 1950's tested students for "extrasensory perception" (ESP) by asking them to guess 10 cards as red or black. He found about 1/1000 of them guessed all 10. He declared them to have ESP. When he retested them, he found they did not do better than average. His conclusion: Telling people they have ESP causes them to lose it! Data Mining (B561) 3

What Is Data Mining? Definition-1: "Discovery of useful summaries of data." Data Mining Course at Stanford University http://www_db.stanford.edu/~ullman/mining/mining.html Definition-2: The mining or discovery of new information in term of patterns or rules from vast amount of data. Fundamental of Database Systems, Elmasri and Navathe, 4 th Edition, Addison Wesley. Data Mining (B561) 4

Data Mining vs. Data Retrieval The existing query tools can be likened to using the equivalent of a flashlight to locate interesting information in data. The user is left to point the flashlight where the user thinks he or she should go to find useful trends and patterns. Data mining discovers patterns that direct the user toward the right questions to ask with traditional query tools. A data mining tool does not require any assumptions; it tries to discover relationships and hidden patterns that may not always be obvious. Data Mining (B561) 5

Applications of Data Mining: Some examples of "successes": 1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan. 2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels, etc. 3. "Diapers and beer." Customers who buy diapers are more likely to buy beer than average customers. This observation allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips in between increased the sales of all three items. Data Mining (B561) 6

Applications of Data Mining: Some examples of "successes": 4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects. 5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining will become much more important as the human genome is constructed. Data Mining Course at Stanford University http://www_db.stanford.edu/~ullman/mining/mining.html Data Mining (B561) 7

Data-Mining Communities: Data-mining has been claimed by an number of research communities: Statistics. Artificial Intelligence, where it is called "machine learning." Neural networks and genetic algorithms are also used. Researchers in clustering algorithms. Visualization researchers. Databases, where data mining can be thought of as algorithms for executing very complex queries on non-main-memory data. Data Mining Course at Stanford University http://www_db.stanford.edu/~ullman/mining/mining.html Data Mining (B561) 8

Stages of the Data-Mining Process: Data gathering Data cleansing Feature extraction Pattern extraction and discovery Visualization of the data Evaluation of results Data Mining (B561) 9

Stages of the Data-Mining Process: Data gathering: (Data warehousing, Web crawling.) Data cleansing: (eliminate errors and/or bogus data, e.g., patient fever = 125.) Feature extraction: (obtaining only the interesting attributes of the data, e.g., "date acquired" is probably not useful for clustering celestial objects, as in Skycat. ) (Remove useless data) Pattern extraction and discovery: (this is the stage that is often thought of as "data mining".) Visualization of the data. Evaluation of results: (not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.) Data Mining (B561) 10

How is Knowledge Discovered? Deductive Knowledge Inductive Knowledge Data Mining (B561) 11

How is Knowledge Discovered? Deductive Knowledge: New information (or facts) are deduced by applying pre-specified logical rules of deduction on a given data. (i.e. Deductive Databases) A Prolog program to build a simple family Knowledgebase. sister(mary,jack). sister(mary,jim). brother(jack,mary). brother(jack,jim). father(john,jack). father(john,jim). sibling(x,y) :_ father(z,x), father(z,y). sibling(x,y) :_ brother(x,y). sibling(x,y) :_ brother(z,x), brother(z,y). sibling(x,y) :_ sister(x,y). sibling(x,y) :_ sister(z,x), sister(z,y). Data Mining (B561) 12

How is Knowledge Discovered? Inductive Knowledge: Discovers new rules and patterns from the supplied data. (i.e. Data mining) Inductive reasoning works by way of moving from specific observations to broader generalizations and theories we begin with specific observations and measures, begin to detect patterns and regularities, formulate some tentative hypotheses that we can explore, and finally end up developing some general conclusions or theories. Data Mining (B561) 13

Typical Results of Data Mining 1. Association Rules: (whenever a customer buys video equipment she also buys a another electronic gadget.) 2. Sequential Patterns: (when a customer buys a camera and within three month he buys photographic supplies, then within six months, he is likely to by an accessory item.) (if the customer buys more than twice in the lean periods, he may be more likely to buy at least once during the Christmas period.) Data Mining (B561) 14

Typical Results of Data Mining 3. Classification Trees/Hierarchies: (Customers may be classified by frequency of visits, by type of financing used, by amount of purchase, by affinity for types of items, and then some revealing statistics may be generated for such classes.) (Customers may be divided into five categories of credit worthiness, based on prior credit transactions) 4. Patterns within time series: (Stock of utility companies X, Y and Z showed the same pattern during 2003, in terms of closing stock price.) (Retail sales index improves, in the months immediately following the tax refund/rebate period.) (Two products show the same sales pattern during summer but not winter.) Data Mining (B561) 15

The Goals of Data Mining: 1. Prediction 2. Identification 3. Classification and Clustering 4. Optimization Data Mining (B561) 16

The Goals of Data Mining: 1. Prediction: Predict future behavior (i.e. predicting that certain discount levels will cause certain Specific customers to purchase an item) (i.e. predicting sales in a given period) (i.e. certain seismic wave patterns may predict an earthquake.) Data Mining (B561) 17

The Goals of Data Mining: 2. Identification: Identifying the existence of an item, event or activity (i.e. system intruders may be identified by the type of programs being executed, files accessed, CPU utilization, network activities and the time at which such event occur. ) Data Mining (B561) 18

The Goals of Data Mining: 3. Classification and Clustering: Partitioning the data into categories of classes. (i.e. discount-seeking shopper, shopper in a rush, loyal and regular shopper, name brand shopper, infrequent shopper, etc.) Data Mining (B561) 19

Classification or Supervised Learning An analyst for a telecommunications company wants to understand why some customers remain loyal while others leave. Ultimately, the analyst wants to predict which customer is most likely to leave and join competitors. The analyst can construct a model derived from historical data of loyal and disloyal customers. Building a model for this business problem requires knowledge of which customers have remained loyal and which have not. This type of mining is called classification or supervised learning, because the training examples are labeled with the actual class they belong to (loyal or lost). Data Mining (B561) 20

Clustering or Unsupervised Learning Retailers want to know where similarities exist in their customer base so that they can create and understand different groups to which they sell and market. The analyst will use a database with rows of customer information and attempt to create customer segments. The data set may contain many attributes such as customers with or without children, single parent and income level. During the discovery process, their difference can be used to separate the data into natural groupings. This approach is referred to as clustering or unsupervised learning. Clustering can be based on historical patterns, but unlike classification approach, the outcome is not supplied with the training data. Data Mining (B561) 21

The Goals of Data Mining: 4. Optimization: Optimize the use of limited resources. (i.e. time, money, space, material, personnel, etc. ) Data Mining (B561) 22

In the real world such results can be used to: Plan store locations based on demographics To run targeted promotions Combine items in advertising Predict what admission criteria will lead to academic success, better retention, and graduation rates. Data Mining (B561) 23

Association Rules and Frequent Item-sets The market-basket problem assumes we have some large number of items, e.g., "bread", "milk", etc. Customers fill their market baskets with some subset of the items, We get to know what items people buy together, even if we don't know who they are. Marketers use this information to position items, and control the way a typical customer traverses the store. Data Mining (B561) 24

Association Rules and Frequent Item-sets In addition to the marketing application, the same sort of question has the following uses: 1. Baskets = documents; items = words. Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering. 2. Baskets = sentences, items = documents. Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web. 3. Baskets = semester schedule, items = courses. Courses appearing together in students schedule may have synergistic effect for current or future semester schedules. Data Mining (B561) 25

Goals for Market-Basket Mining 1. Association rules are statements of the form (X1 ;X2 ;...;Xn) Y Y, meaning that if we find all of X1 ;X2 ;...;Xn in the market basket, then we have a good chance of finding Y. The probability of finding Y for us to accept this rule is called the confidence of the rule. We normally would search only for rules that have confidence above a certain threshold. (significantly higher than random placement into baskets) Data Mining (B561) 26

Goals for Market-Basket Mining Example-1: (Low confidence) {milk; butter} Y bread simply because a lot of people buy bread. Consider the following examples: {shoe polish} Y bread {vine} Y bread {flower} Y bread Data Mining (B561) 27

Goals for Market-Basket Mining Example-2: (High confidence) {diapers} Y beer The beer/diapers story asserts that the rule {diapers} Y beer holds with confidence significantly greater than the fraction of baskets that contain beer. Data Mining (B561) 28

Causality: Ideally, we would like to know that in an association rule the presence of X1 ;...;Xn actually "causes" Y to be bought. However, "causality" is an elusive concept. nevertheless, for market-basket data, the following test suggests what causality means. If we lower the price of diapers and raise the price of beer, we can lure diaper buyers, who are more likely to pick up beer while in the store, thus covering our losses on the diapers. That strategy works because "diapers causes beer. However, working it the other way round, running a sale on beer and raising the price of diapers, will not result in beer buyers buying diapers in any great numbers, and we lose money. Data Mining (B561) 29

Frequent Item-sets: In many (but not all) situations, we only care about association rules or causalities involving sets of items that appear frequently in baskets. For example, we cannot run a good marketing strategy involving items that no one buys anyway. Thus, much data mining starts with the assumption that we only care about sets of items with high support; i.e., they appear together in many baskets. We then find association rules or causalities only involving a high-support set of items (i.e., (X1 ;...;Xn; and Y) must appear in at least a certain percent of the baskets, called the support threshold. Data Mining (B561) 30

Implementing Association Rules: An Association rule is of form X Y where X = { x1, x2,.., xn} and Y = { y1, y2,.., ym} are set of items, with x i and y j being distinct items for all i and all j. X Y states that if a customer buys X, then she is likely to buy Y. In general LHS RHS where LHS and RHS are are set of items. The set LHS U RHS is called an item-set (e.g. a set of items purchased by customers. For an association rule to be considered interesting, the rule must satisfy Support and Confidence measures. Data Mining (B561) 31

Support or Prevalence for Association Rule: Support for the rule LHS RHS refers to how frequently a specific item-set occurs in the data base. Percentage of transactions that contain all the items in the item-set (LHS U RHS) If the support is low, it implies that item-set occurs in only a small fraction of transactions and therefore, the association rule is not as reliable. Data Mining (B561) 32

Confidence or Strength for Association Rule: Confidence for the rule LHS RHS refers to how strong the association is. Confidence is calculated as: Support(LHS U RHS) / Support(LHS) In other words, the probability that the items in RHS will be purchased, given that the items in LHS are purchased. Data Mining (B561) 33

Example of Association Rules: T-ID 101 Time 6:35 Items Bought milk, bread, cookies, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:45 bread, cookies, coffee Suppose the following association rules have been observed: milk juice And bread juice Data Mining (B561) 34

Example of Association Rules: T-ID 101 792 1130 1735 Time 6:35 7:38 8:05 8:45 Items Bought milk, bread, cookies, juice milk, juice milk, eggs bread, cookies, coffee What is the support for {milk, juice}? What is the support for {bread, juice}? Data Mining (B561) 35

Example of Association Rules: T-ID 101 Time 6:35 Items Bought milk, bread, cookies, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:45 bread, cookies, coffee What is the support for {milk, juice}? 50% What is the support for {bread, juice}? 25% Data Mining (B561) 36

Example of Association Rules: T-ID 101 792 1130 1735 Time 6:35 7:38 8:05 8:45 Items Bought milk, bread, cookies, juice milk, juice milk, eggs bread, cookies, coffee What is the confidence for milk juice? What is the confidence for bread juice? Data Mining (B561) 37

Example of Association Rules: T-ID 101 Time 6:35 Items Bought milk, bread, cookies, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:45 bread, cookies, coffee What is the confidence for milk juice? 50% / 75% = 66.7% What is the confidence for bread juice? 25% / 50% = 50% Data Mining (B561) 38

What is a good Association Rule? The goal of mining association rules is to generate all possible rules that exceed some minimum User-Specified support and confidence thresholds. Data Mining (B561) 39

Data mining algorithms At the heart of data mining is the process of building a model to represent a data set. Vendors/researchers often discuss the differences in model built using algorithms and approaches. There are hundreds of derivative approaches under the generic data mining model names like neural networks, agent networks, decision trees, concept hierarchies, genetic algorithms, fuzzy logic, and belief networks. For example, Neural Ware offers a neural network product set that offers over 25 different neural network approaches. Data Mining (B561) 40

How does "Data Mining" compare with other statistical techniques? Data analysis has been in existence for decades and the advent of computers and statistics accelerated manipulation of very large data sets for discovering knowledge. Statistical approaches to data analysis involve a process called regression analysis, which has been used to model data. Various regression models rely upon underlying assumptions that the underlying data is well-behaved and that the relationship structures are of a form that can be linearly transformed for ease of estimation. This restricts the ability of the modeler because in the real world, things do not function according to some predictable linear function. The new "Machine Learning" or data mining techniques impose no such prior restraints on the model and can seek out relationships that would otherwise go undetected by traditional methods. Data Mining (B561) 41

Why should you consider using "Data Mining"? Data mining automates the process of discovering useful trends and patterns. It can be designed so as to automate the process of learning about evolving relationships with the aid of an expert, the model builder. When dealing with large databases, data mining is a computationally intensive process and requires a fair amount of disk space as well. Decreases in hardware costs have made data mining available to a much wider audience. Increase in the power of PCs and a decrease in its cost has made data mining feasible for all types of businesses - large and small. Data Mining (B561) 42