Data Mining Concepts

Similar documents
Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Supervised and Unsupervised Learning (II)

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Chapter 4: Mining Frequent Patterns, Associations and Correlations

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

An Introduction to Data Mining

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Chapter 1, Introduction

Data Mining Clustering

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Table Of Contents: xix Foreword to Second Edition

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Association Rules. Berlin Chen References:

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Chapter 4 Data Mining A Short Introduction

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Data Mining and Analytics. Introduction

2. Discovery of Association Rules

Data Mining Course Overview

Mining Association Rules in Large Databases

Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

Basic Data Mining Technique

Jarek Szlichta

Contents. Preface to the Second Edition

Association Pattern Mining. Lijun Zhang

Comparison of FP tree and Apriori Algorithm

Introduction to Data Mining and Data Analytics

Data warehouse and Data Mining

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Association Rules Outline

Optimization using Ant Colony Algorithm

Chapter 3: Data Mining:

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

An Effectual Approach to Swelling the Selling Methodology in Market Basket Analysis using FP Growth

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Improved Frequent Pattern Mining Algorithm with Indexing

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

An Improved Apriori Algorithm for Association Rules

DATA MINING Introductory and Advanced Topics Part I

Production rule is an important element in the expert system. By interview with

COMP 465 Special Topics: Data Mining

Data Mining and Data Warehousing Introduction to Data Mining

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Association Rules Apriori Algorithm

Machine Learning: Symbolische Ansätze

Optimization of Association Rule Mining Using Genetic Algorithm

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Introduction to Data Mining

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Product presentations can be more intelligently planned

Data warehouses Decision support The multidimensional model OLAP queries

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

BCB 713 Module Spring 2011

Frequent Pattern Mining

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Knowledge Discovery and Data Mining

Interestingness Measurements

KNOWLEDGE DISCOVERY AND DATA MINING

Association Rules Apriori Algorithm

2 CONTENTS

SCHEME OF COURSE WORK. Data Warehousing and Data mining

The k-means Algorithm and Genetic Algorithm

Enhanced Entity-Relationship (EER) Modeling

Interestingness Measurements

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Oracle9i Data Mining. Data Sheet August 2002

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Study on the Application Analysis and Future Development of Data Mining Technology

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 16-1

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

Association Rule Discovery

COMPARISON OF K-MEAN ALGORITHM & APRIORI ALGORITHM AN ANALYSIS

Temporal Weighted Association Rule Mining for Classification

M.Kannan et al IJCSET Feb 2011 Vol 1, Issue 1,30-34

Data Access Paths for Frequent Itemsets Discovery

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining

Clustering Analysis based on Data Mining Applications Xuedong Fan

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Parallelism in Knowledge Discovery Techniques

Association Rule Mining. Entscheidungsunterstützungssysteme

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

Associating Terms with Text Categories

Gurpreet Kaur 1, Naveen Aggarwal 2 1,2

Nesnelerin İnternetinde Veri Analizi

Chapter 14. Database Design Theory: Introduction to Normalization Using Functional and Multivalued Dependencies

Transcription:

Data Mining Concepts

Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-2

Definitions of Data Mining The discovery of new information in terms of patterns or rules from vast amounts of data. The process of finding interesting structure in data. The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-3

Data Warehousing The data warehouse is a historical database designed for decision support. Data mining can be applied to the data in a warehouse to help with certain types of decisions. Proper construction of a data warehouse is fundamental to the successful use of data mining. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-4

Knowledge Discovery in Databases (KDD) Data mining is actually one step of a larger process known as knowledge discovery in databases (KDD). The KDD process model comprises six phases Data selection Data cleansing Enrichment Data transformation or encoding Data mining Reporting and displaying discovered knowledge Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-5

Goals of Data Mining and Knowledge Discovery (PICO) Prediction: Determine how certain attributes will behave in the future. Identification: Identify the existence of an item, event, or activity. Classification: Partition data into classes or categories. Optimization: Optimize the use of limited resources. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-6

Types of Discovered Knowledge Association Rules Classification Hierarchies Sequential Patterns Patterns Within Time Series Clustering Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-7

Association Rules Association rules are frequently used to generate rules from market-basket data. A market basket corresponds to the sets of items a consumer purchases during one visit to a supermarket. The set of items purchased by customers is known as an itemset. An association rule is of the form X=>Y, where X ={x 1, x 2,., x n }, and Y = {y 1,y 2,., y n } are sets of items, with x i and y i being distinct items for all i and all j. For an association rule to be of interest, it must satisfy a minimum support and confidence. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-8

Association Rules Confidence and Support Support: The minimum percentage of instances in the database that contain all items listed in a given association rule. Support is the percentage of transactions that contain all of the items in the itemset, LHS U RHS. Confidence: Given a rule of the form A=>B, rule confidence is the conditional probability that B is true when A is known to be true. Confidence can be computed as support(lhs U RHS) / support(lhs) Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-9

Generating Association Rules The general algorithm for generating association rules is a two-step process. Generate all itemsets that have a support exceeding the given threshold. Itemsets with this property are called large or frequent itemsets. Generate rules for each itemset as follows: For itemset X and Y a subset of X, let Z = X Y; If support(x)/support(z) > minimum confidence, the rule Z=>Y is a valid rule. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-10

Generating Association Rules: The Apriori Algorithm The Apriori algorithm was the first algorithm used to generate association rules. The Apriori algorithm uses the general algorithm for creating association rules together with downward closure and anti-monotonicity. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-11

Generating Association Rules: The Sampling Algorithm The sampling algorithm selects samples from the database of transactions that individually fit into memory. Frequent itemsets are then formed for each sample. If the frequent itemsets form a superset of the frequent itemsets for the entire database, then the real frequent itemsets can be obtained by scanning the remainder of the database. In some rare cases, a second scan of the database is required to find all frequent itemsets. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-12

Generating Association Rules: Frequent-Pattern Tree Algorithm The Frequent-Pattern Tree Algorithm reduces the total number of candidate itemsets by producing a compressed version of the database in terms of an FP-tree. The FP-tree stores relevant information and allows for the efficient discovery of frequent itemsets. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-13

Generating Association Rules: The Partition Algorithm Divide the database into non-overlapping subsets. Treat each subset as a separate database where each subset fits entirely into main memory. Apply the Apriori algorithm to each partition. Take the union of all frequent itemsets from each partition. These itemsets form the global candidate frequent itemsets for the entire database. Verify the global set of itemsets by having their actual support measured for the entire database. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-14

Complications seen with Association Rules The cardinality of itemsets in most situations is extremely large. Association rule mining is more difficult when transactions show variability in factors such as geographic location and seasons. Item classifications exist along multiple dimensions. Data quality is variable; data may be missing, erroneous, conflicting, as well as redundant. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-15

Classification Classification is the process of learning a model that is able to describe different classes of data. Learning is supervised as the classes to be learned are predetermined. Learning is accomplished by using a training set of pre-classified data. The model produced is usually in the form of a decision tree or a set of rules. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-16

Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-17

An Example Rule Here is one of the rules extracted from the decision tree of Figure 28.7. IF 50K > salary >= 20K AND age >=25 THEN class is yes Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-18

Clustering Unsupervised learning or clustering builds models from data without predefined classes. The goal is to place records into groups where the records in a group are highly similar to each other and dissimilar to records in other groups. The k-means algorithm is a simple yet effective clustering technique. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-19

Additional Data Mining Methods Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-20

Sequential Pattern Analysis Transactions ordered by time of purchase form a sequence of itemsets. The problem is to find all subsequences from a given set of sequences that have a minimum support. The sequence S 1, S 2, S 3,.. is a predictor of the fact that a customer purchasing itemset S 1 is likely to buy S 2, and then S 3, and so on. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-21

Time Series Analysis Time series are sequences of events. For example, the closing price of a stock is an event that occurs each day of the week. Time series analysis can be used to identify the price trends of a stock or mutual fund. Time series analysis is an extended functionality of temporal data management. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-22

Regression Analysis A regression equation estimates a dependent variable using a set of independent variables and a set of constants. The independent variables as well as the dependent variable are numeric. A regression equation can be written in the form Y=f(x 1,x 2,,x n ) where Y is the dependent variable. If f is linear in the domain variables x i, the equation is call a linear regression equation. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-23

Neural Networks A neural network is a set of interconnected nodes designed to imitate the functioning of the brain. Node connections have weights which are modified during the learning process. Neural networks can be used for supervised learning and unsupervised clustering. The output of a neural network is quantitative and not easily understood. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-24

Genetic Learning Genetic learning is based on the theory of evolution. An initial population of several candidate solutions is provided to the learning model. A fitness function defines which solutions survive from one generation to the next. Crossover, mutation and selection are used to create new population elements. Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-25

Data Mining Applications Marketing Marketing strategies and consumer behavior Finance Fraud detection, creditworthiness and investment analysis Manufacturing Resource optimization Health Image analysis, side effects of drug, and treatment effectiveness Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-26

Recap Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms Copyright 2011 Ramez Elmasri and Shamkant B. Navathe Slide 28-27