An Introduction to Data Mining

Similar documents
Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

signicantly higher than it would be if items were placed at random into baskets. For example, we

Data Mining Concepts

Data Mining Algorithms

DATA MINING. Prof. Navneet Goyal Department of Computer Science & Information Systems, BITS, Pilani.

Knowledge Discovery and Data Mining

Grading. 20% Activity (course attendance and homework) 40% Project (project attendance, algorithm presentation, project delivery)

Grading. Road Map. Definition ([Liu 11]) Definition ([Wikipedia]) Definition ([Ullman 09, 10])

Jarek Szlichta

COMP90049 Knowledge Technologies

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Data Mining Clustering

Supervised and Unsupervised Learning (II)

Machine Learning: Symbol-based

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Jeffrey D. Ullman Stanford University

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Foundation of Data Mining: Introduction

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

KNOWLEDGE DISCOVERY AND DATA MINING

Data Mining Course Overview

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

Association Rules. Berlin Chen References:

Association Rule Mining. Entscheidungsunterstützungssysteme

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Knowledge Discovery & Data Mining

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

D B M G Data Base and Data Mining Group of Politecnico di Torino

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

An Effectual Approach to Swelling the Selling Methodology in Market Basket Analysis using FP Growth

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

ISM 50 - Business Information Systems

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

CISC 4631 Data Mining Lecture 01:

Introduction to Data Mining and Data Analytics

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Nesnelerin İnternetinde Veri Analizi

COMP 465 Special Topics: Data Mining

Chapter 4 Data Mining A Short Introduction

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

International Journal of Mechatronics, Electrical and Computer Technology

Oracle9i Data Mining. Data Sheet August 2002

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data mining fundamentals

DATA MINING TRANSACTION

Data warehouses Decision support The multidimensional model OLAP queries

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 4: Association analysis:

Final Exam DATA MINING I - 1DL360

Association Pattern Mining. Lijun Zhang

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Application of Data Mining in Library and Information Services

A Systems Approach to Dimensional Modeling in Data Marts. Joseph M. Firestone, Ph.D. White Paper No. One. March 12, 1997

CSE4334/5334 DATA MINING

The k-means Algorithm and Genetic Algorithm

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

The Use of Fuzzy Logic at Support of Manager Decision Making

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)

Machine Learning: Symbolische Ansätze

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

Data warehouse and Data Mining

Software Engineering Prof.N.L.Sarda IIT Bombay. Lecture-11 Data Modelling- ER diagrams, Mapping to relational model (Part -II)

Now, Data Mining Is Within Your Reach

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

On-Line Application Processing

Efficient Frequent Itemset Mining Mechanism Using Support Count

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

Association Rule Discovery

The Fuzzy Search for Association Rules with Interestingness Measure

Association mining rules

Question Bank. 4) It is the source of information later delivered to data marts.

Case Study: SAP BW Data Mining (Association Analysis)

Data Mining: Approach Towards The Accuracy Using Teradata!

Assignment 3 User Research Report Document

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Research on Data Mining and Statistical Analysis Xiaoyao Lu1, a

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017

1 Machine Learning System Design

TIM 50 - Business Information Systems

Data Mining & Machine Learning F2.4DN1/F2.9DM1

Elena Marchiori Free University Amsterdam, Faculty of Science, Department of Mathematics and Computer Science, Amsterdam, The Netherlands

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

MULTI-CLIENT 2017 US INTERCHANGEABLE LENS CAMERA MARKET STUDY. Consumer Imaging Behaviors and Industry Trends SERVICE AREAS:

Study on the Application Analysis and Future Development of Data Mining Technology

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Transcription:

An Introduction to Data Mining Hossein Hakimzadeh Computer and Information Sciences Data Mining (B561) 1

What Is Data Mining? Original Definition: "data mining" was a statistician's term for overusing data to draw invalid inferences. Bonferroni's theorem: If there are too many possible conclusions to draw, some will be true for purely statistical reasons, with no physical validity. Data Mining (B561) 2

What Is Data Mining? David Rhine, a "parapsychologist" at Duke in the 1950's tested students for "extrasensory perception" (ESP) by asking them to guess 10 cards as red or black. He found about 1/1000 of them guessed all 10. He declared them to have ESP. When he retested them, he found they did not do better than average. His conclusion: Telling people they have ESP causes them to lose it! Data Mining (B561) 3

What Is Data Mining? Definition-1: "Discovery of useful summaries of data." Data Mining Course at Stanford University http://www_db.stanford.edu/~ullman/mining/mining.html Definition-2: The mining or discovery of new information in term of patterns or rules from vast amount of data. Fundamental of Database Systems, Elmasri and Navathe, 4 th Edition, Addison Wesley. Data Mining (B561) 4

Data Mining vs. Data Retrieval The existing query tools can be likened to using the equivalent of a flashlight to locate interesting information in data. The user is left to point the flashlight where the user thinks he or she should go to find useful trends and patterns. Data mining discovers patterns that direct the user toward the right questions to ask with traditional query tools. A data mining tool does not require any assumptions; it tries to discover relationships and hidden patterns that may not always be obvious. Data Mining (B561) 5

Applications of Data Mining: Some examples of "successes": 1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan. 2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels, etc. 3. "Diapers and beer." Customers who buy diapers are more likely to buy beer than average customers. This observation allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips in between increased the sales of all three items. Data Mining (B561) 6

Applications of Data Mining: Some examples of "successes": 4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects. 5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining will become much more important as the human genome is constructed. Data Mining Course at Stanford University http://www_db.stanford.edu/~ullman/mining/mining.html Data Mining (B561) 7

Data-Mining Communities: Data-mining has been claimed by an number of research communities: Statistics. Artificial Intelligence, where it is called "machine learning." Neural networks and genetic algorithms are also used. Researchers in clustering algorithms. Visualization researchers. Databases, where data mining can be thought of as algorithms for executing very complex queries on non-main-memory data. Data Mining Course at Stanford University http://www_db.stanford.edu/~ullman/mining/mining.html Data Mining (B561) 8

Stages of the Data-Mining Process: Data gathering Data cleansing Feature extraction Pattern extraction and discovery Visualization of the data Evaluation of results Data Mining (B561) 9

Stages of the Data-Mining Process: Data gathering: (Data warehousing, Web crawling.) Data cleansing: (eliminate errors and/or bogus data, e.g., patient fever = 125.) Feature extraction: (obtaining only the interesting attributes of the data, e.g., "date acquired" is probably not useful for clustering celestial objects, as in Skycat. ) (Remove useless data) Pattern extraction and discovery: (this is the stage that is often thought of as "data mining".) Visualization of the data. Evaluation of results: (not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.) Data Mining (B561) 10

How is Knowledge Discovered? Deductive Knowledge Inductive Knowledge Data Mining (B561) 11

How is Knowledge Discovered? Deductive Knowledge: New information (or facts) are deduced by applying pre-specified logical rules of deduction on a given data. (i.e. Deductive Databases) A Prolog program to build a simple family Knowledgebase. sister(mary,jack). sister(mary,jim). brother(jack,mary). brother(jack,jim). father(john,jack). father(john,jim). sibling(x,y) :_ father(z,x), father(z,y). sibling(x,y) :_ brother(x,y). sibling(x,y) :_ brother(z,x), brother(z,y). sibling(x,y) :_ sister(x,y). sibling(x,y) :_ sister(z,x), sister(z,y). Data Mining (B561) 12

How is Knowledge Discovered? Inductive Knowledge: Discovers new rules and patterns from the supplied data. (i.e. Data mining) Inductive reasoning works by way of moving from specific observations to broader generalizations and theories we begin with specific observations and measures, begin to detect patterns and regularities, formulate some tentative hypotheses that we can explore, and finally end up developing some general conclusions or theories. Data Mining (B561) 13

Typical Results of Data Mining 1. Association Rules: (whenever a customer buys video equipment she also buys a another electronic gadget.) 2. Sequential Patterns: (when a customer buys a camera and within three month he buys photographic supplies, then within six months, he is likely to by an accessory item.) (if the customer buys more than twice in the lean periods, he may be more likely to buy at least once during the Christmas period.) Data Mining (B561) 14

Typical Results of Data Mining 3. Classification Trees/Hierarchies: (Customers may be classified by frequency of visits, by type of financing used, by amount of purchase, by affinity for types of items, and then some revealing statistics may be generated for such classes.) (Customers may be divided into five categories of credit worthiness, based on prior credit transactions) 4. Patterns within time series: (Stock of utility companies X, Y and Z showed the same pattern during 2003, in terms of closing stock price.) (Retail sales index improves, in the months immediately following the tax refund/rebate period.) (Two products show the same sales pattern during summer but not winter.) Data Mining (B561) 15

The Goals of Data Mining: 1. Prediction 2. Identification 3. Classification and Clustering 4. Optimization Data Mining (B561) 16

The Goals of Data Mining: 1. Prediction: Predict future behavior (i.e. predicting that certain discount levels will cause certain Specific customers to purchase an item) (i.e. predicting sales in a given period) (i.e. certain seismic wave patterns may predict an earthquake.) Data Mining (B561) 17

The Goals of Data Mining: 2. Identification: Identifying the existence of an item, event or activity (i.e. system intruders may be identified by the type of programs being executed, files accessed, CPU utilization, network activities and the time at which such event occur. ) Data Mining (B561) 18

The Goals of Data Mining: 3. Classification and Clustering: Partitioning the data into categories of classes. (i.e. discount-seeking shopper, shopper in a rush, loyal and regular shopper, name brand shopper, infrequent shopper, etc.) Data Mining (B561) 19

Classification or Supervised Learning An analyst for a telecommunications company wants to understand why some customers remain loyal while others leave. Ultimately, the analyst wants to predict which customer is most likely to leave and join competitors. The analyst can construct a model derived from historical data of loyal and disloyal customers. Building a model for this business problem requires knowledge of which customers have remained loyal and which have not. This type of mining is called classification or supervised learning, because the training examples are labeled with the actual class they belong to (loyal or lost). Data Mining (B561) 20

Clustering or Unsupervised Learning Retailers want to know where similarities exist in their customer base so that they can create and understand different groups to which they sell and market. The analyst will use a database with rows of customer information and attempt to create customer segments. The data set may contain many attributes such as customers with or without children, single parent and income level. During the discovery process, their difference can be used to separate the data into natural groupings. This approach is referred to as clustering or unsupervised learning. Clustering can be based on historical patterns, but unlike classification approach, the outcome is not supplied with the training data. Data Mining (B561) 21

The Goals of Data Mining: 4. Optimization: Optimize the use of limited resources. (i.e. time, money, space, material, personnel, etc. ) Data Mining (B561) 22

In the real world such results can be used to: Plan store locations based on demographics To run targeted promotions Combine items in advertising Predict what admission criteria will lead to academic success, better retention, and graduation rates. Data Mining (B561) 23

Association Rules and Frequent Item-sets The market-basket problem assumes we have some large number of items, e.g., "bread", "milk", etc. Customers fill their market baskets with some subset of the items, We get to know what items people buy together, even if we don't know who they are. Marketers use this information to position items, and control the way a typical customer traverses the store. Data Mining (B561) 24

Association Rules and Frequent Item-sets In addition to the marketing application, the same sort of question has the following uses: 1. Baskets = documents; items = words. Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering. 2. Baskets = sentences, items = documents. Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web. 3. Baskets = semester schedule, items = courses. Courses appearing together in students schedule may have synergistic effect for current or future semester schedules. Data Mining (B561) 25

Goals for Market-Basket Mining 1. Association rules are statements of the form (X1 ;X2 ;...;Xn) Y Y, meaning that if we find all of X1 ;X2 ;...;Xn in the market basket, then we have a good chance of finding Y. The probability of finding Y for us to accept this rule is called the confidence of the rule. We normally would search only for rules that have confidence above a certain threshold. (significantly higher than random placement into baskets) Data Mining (B561) 26

Goals for Market-Basket Mining Example-1: (Low confidence) {milk; butter} Y bread simply because a lot of people buy bread. Consider the following examples: {shoe polish} Y bread {vine} Y bread {flower} Y bread Data Mining (B561) 27

Goals for Market-Basket Mining Example-2: (High confidence) {diapers} Y beer The beer/diapers story asserts that the rule {diapers} Y beer holds with confidence significantly greater than the fraction of baskets that contain beer. Data Mining (B561) 28

Causality: Ideally, we would like to know that in an association rule the presence of X1 ;...;Xn actually "causes" Y to be bought. However, "causality" is an elusive concept. nevertheless, for market-basket data, the following test suggests what causality means. If we lower the price of diapers and raise the price of beer, we can lure diaper buyers, who are more likely to pick up beer while in the store, thus covering our losses on the diapers. That strategy works because "diapers causes beer. However, working it the other way round, running a sale on beer and raising the price of diapers, will not result in beer buyers buying diapers in any great numbers, and we lose money. Data Mining (B561) 29

Frequent Item-sets: In many (but not all) situations, we only care about association rules or causalities involving sets of items that appear frequently in baskets. For example, we cannot run a good marketing strategy involving items that no one buys anyway. Thus, much data mining starts with the assumption that we only care about sets of items with high support; i.e., they appear together in many baskets. We then find association rules or causalities only involving a high-support set of items (i.e., (X1 ;...;Xn; and Y) must appear in at least a certain percent of the baskets, called the support threshold. Data Mining (B561) 30

Implementing Association Rules: An Association rule is of form X Y where X = { x1, x2,.., xn} and Y = { y1, y2,.., ym} are set of items, with x i and y j being distinct items for all i and all j. X Y states that if a customer buys X, then she is likely to buy Y. In general LHS RHS where LHS and RHS are are set of items. The set LHS U RHS is called an item-set (e.g. a set of items purchased by customers. For an association rule to be considered interesting, the rule must satisfy Support and Confidence measures. Data Mining (B561) 31

Support or Prevalence for Association Rule: Support for the rule LHS RHS refers to how frequently a specific item-set occurs in the data base. Percentage of transactions that contain all the items in the item-set (LHS U RHS) If the support is low, it implies that item-set occurs in only a small fraction of transactions and therefore, the association rule is not as reliable. Data Mining (B561) 32

Confidence or Strength for Association Rule: Confidence for the rule LHS RHS refers to how strong the association is. Confidence is calculated as: Support(LHS U RHS) / Support(LHS) In other words, the probability that the items in RHS will be purchased, given that the items in LHS are purchased. Data Mining (B561) 33

Example of Association Rules: T-ID 101 Time 6:35 Items Bought milk, bread, cookies, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:45 bread, cookies, coffee Suppose the following association rules have been observed: milk juice And bread juice Data Mining (B561) 34

Example of Association Rules: T-ID 101 792 1130 1735 Time 6:35 7:38 8:05 8:45 Items Bought milk, bread, cookies, juice milk, juice milk, eggs bread, cookies, coffee What is the support for {milk, juice}? What is the support for {bread, juice}? Data Mining (B561) 35

Example of Association Rules: T-ID 101 Time 6:35 Items Bought milk, bread, cookies, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:45 bread, cookies, coffee What is the support for {milk, juice}? 50% What is the support for {bread, juice}? 25% Data Mining (B561) 36

Example of Association Rules: T-ID 101 792 1130 1735 Time 6:35 7:38 8:05 8:45 Items Bought milk, bread, cookies, juice milk, juice milk, eggs bread, cookies, coffee What is the confidence for milk juice? What is the confidence for bread juice? Data Mining (B561) 37

Example of Association Rules: T-ID 101 Time 6:35 Items Bought milk, bread, cookies, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:45 bread, cookies, coffee What is the confidence for milk juice? 50% / 75% = 66.7% What is the confidence for bread juice? 25% / 50% = 50% Data Mining (B561) 38

What is a good Association Rule? The goal of mining association rules is to generate all possible rules that exceed some minimum User-Specified support and confidence thresholds. Data Mining (B561) 39

Data mining algorithms At the heart of data mining is the process of building a model to represent a data set. Vendors/researchers often discuss the differences in model built using algorithms and approaches. There are hundreds of derivative approaches under the generic data mining model names like neural networks, agent networks, decision trees, concept hierarchies, genetic algorithms, fuzzy logic, and belief networks. For example, Neural Ware offers a neural network product set that offers over 25 different neural network approaches. Data Mining (B561) 40

How does "Data Mining" compare with other statistical techniques? Data analysis has been in existence for decades and the advent of computers and statistics accelerated manipulation of very large data sets for discovering knowledge. Statistical approaches to data analysis involve a process called regression analysis, which has been used to model data. Various regression models rely upon underlying assumptions that the underlying data is well-behaved and that the relationship structures are of a form that can be linearly transformed for ease of estimation. This restricts the ability of the modeler because in the real world, things do not function according to some predictable linear function. The new "Machine Learning" or data mining techniques impose no such prior restraints on the model and can seek out relationships that would otherwise go undetected by traditional methods. Data Mining (B561) 41

Why should you consider using "Data Mining"? Data mining automates the process of discovering useful trends and patterns. It can be designed so as to automate the process of learning about evolving relationships with the aid of an expert, the model builder. When dealing with large databases, data mining is a computationally intensive process and requires a fair amount of disk space as well. Decreases in hardware costs have made data mining available to a much wider audience. Increase in the power of PCs and a decrease in its cost has made data mining feasible for all types of businesses - large and small. Data Mining (B561) 42