Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Similar documents
What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Knowledge Discovery and Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Frequent Pattern Mining

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

D B M G Data Base and Data Mining Group of Politecnico di Torino

Data Mining Course Overview

Data mining fundamentals

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

9. Conclusions. 9.1 Definition KDD

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Jarek Szlichta

Jeffrey D. Ullman Stanford University

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Database and Knowledge-Base Systems: Data Mining. Martin Ester

DATA MINING II - 1DL460

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Knowledge Discovery in Data Bases

COMP90049 Knowledge Technologies

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data warehouse and Data Mining

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

DATA MINING II - 1DL460

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

Nesnelerin İnternetinde Veri Analizi

Data Mining Clustering

Data Mining Concepts

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Data Mining Algorithms

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Introduction to Data Mining and Data Analytics

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Knowledge Discovery & Data Mining

Association Rule Mining. Entscheidungsunterstützungssysteme

Chapter 4 Data Mining A Short Introduction

Introduction to Data Mining

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

ISM 50 - Business Information Systems

Big Data Analytics CSCI 4030

Association Rules. CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University. Slides adapted from lectures by Jeff Ullman

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

COMP 465 Special Topics: Data Mining

Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Data Mining Concepts & Tasks

1. Inroduction to Data Mininig

Foundation of Data Mining: Introduction

BCB 713 Module Spring 2011

Data Mining Concepts & Tasks

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

CISC 4631 Data Mining Lecture 01:

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Warehousing. Data Mining

COMPARISON OF K-MEAN ALGORITHM & APRIORI ALGORITHM AN ANALYSIS

On-Line Application Processing

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Supervised and Unsupervised Learning (II)

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Association Pattern Mining. Lijun Zhang

INTRODUCTION TO DATA MINING

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

PATTERN DISCOVERY IN TIME-ORIENTED DATA

Introduction to Data Mining CS 584 Data Mining (Fall 2016)

An Improved Apriori Algorithm for Association Rules

Association Rules. Berlin Chen References:

Association Rules Apriori Algorithm

Interestingness Measurements

Machine Learning: Symbolische Ansätze

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

CS570 Introduction to Data Mining

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Chapter 7: Frequent Itemsets and Association Rules

Basic Data Mining Technique

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Classification by Association

Chapter 1 Introduction to Data Mining

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

CSE 626: Data mining. Instructor: Sargur N. Srihari. Phone: , ext. 113

Association Rules Apriori Algorithm

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Chapter 3: Data Mining:

ASSOCIATION RULE MINING: MARKET BASKET ANALYSIS OF A GROCERY STORE

Decision Support Systems

DATA MINING LECTURE 1. Introduction

An Effectual Approach to Swelling the Selling Methodology in Market Basket Analysis using FP Growth

Winter Semester 2009/10 Free University of Bozen, Bolzano

Transcription:

Data Mining and Information Retrieval Introduction to Data Mining

Why Data Mining? Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a huge amount of data, how to analyze and understand the data? CMPT 454: Database Systems II Introduction to Data Mining 2/ 26

Example: Data Collection Each customer has a club member card Transactions: customers purchases of commodities {bread, milk, cheese} if they are bought together Each transaction is associated with a customer, a store, time, total price Analytic question: what are product combinations that are frequently purchased together by customers? CMPT 454: Database Systems II Introduction to Data Mining 3/ 26

Example: Data Pre-processing Data selection Customer ids, stores are not interesting Only the transactions are selected Data integration Integrate data from all stores Partition data by time, e.g., a data set per week Data cleaning Normalize data sets for long weekends Estimate effects of promotion campaigns CMPT 454: Database Systems II Introduction to Data Mining 4/ 26

Example: Mining Identify proper data mining methods Only the hot product combinations? Changes of hot product combinations over time? Only combinations having expensive items? Select appropriate algorithms Centralized vs. distributed databases? Mining in batch or online? Incremental mining? CMPT 454: Database Systems II Introduction to Data Mining 5/ 26

Example: Evaluation Find a pattern {digital camera, memory stick, image processing software} Explanation Customers buying digital camera often purchase memory stick and image processing software at the same time Actions Cluster digital it cameras, memory sticks and image processing software together in stores Promote oememory oystick to oattract ac more oedg digital camera ea purchases An interesting result: beer and baby diaper CMPT 454: Database Systems II Introduction to Data Mining 6/ 26

What is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] CMPT 454: Database Systems II Introduction to Data Mining 7/ 26

The KDD Process CMPT 454: Database Systems II Introduction to Data Mining 8/ 26

What Kinds of Data to Be Mined? Any data that are useful in practice Some examples Relational databases Transactional databases Spatial data Time-series Semi-structured data & WWW Streaming data Bio-medical data Network traffic data Sensor network surveillance data CMPT 454: Database Systems II Introduction to Data Mining 9/ 26

What Kinds of Patterns? Any meaningful complicated patterns that are not directly query-able from data Some examples Association rules and sequential patterns Classification Clusters and outliers CMPT 454: Database Systems II Introduction to Data Mining 10 / 26

Identify Interesting Patterns There can be a huge number of patterns Find all product combinations purchased by more than 1% of customers {bread, milk} is trivial, common sense {diamond, pearl necklace} is informative Various users may be interested in different patterns Find the patterns strongly wanted by users CMPT 454: Database Systems II Introduction to Data Mining 11 / 26

Association Rules The Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day. CMPT 454: Database Systems II Introduction to Data Mining 12 / 26

Support Simplest question: find sets of items that appear frequently in the baskets. Support for itemset I = the number of baskets containing all items in I. Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets. CMPT 454: Database Systems II Introduction to Data Mining 13 / 26

Example Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets. B 1 ={m {m, c, b} B 2 ={m {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, p, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} Frequent itemsets: {m}, },{c},{b},{j}, {m, b}, {c, b}, {j, c}. CMPT 454: Database Systems II Introduction to Data Mining 14 / 26

Application (1) Real market baskets: chain stores keep terabytes of information about what customers buy together. Tells how typical customers navigate stores, lets them position tempting items. Suggests tie-in tricks, e.g., run sale on diapers and raise the price of beer. High support needed, or no $$ s. CMPT 454: Database Systems II Introduction to Data Mining 15 / 26

Application (2) Baskets = documents; items = words in those documents. Lets us find words that appear together unusually frequently, i.e., linked concepts. Baskets = sentences, items = documents containing those sentences. Items that appear together too often could represent plagiarism. CMPT 454: Database Systems II Introduction to Data Mining 16 / 26

Association Rules If-then rules about the contents of baskets. {i 1, i 2,,i k } j means: if a basket contains all of i 1,,ii k then it is likely to contain j. Confidence of this association rule is the probability of j given i 1,,i k. CMPT 454: Database Systems II Introduction to Data Mining 17 / 26

Example B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 ={m {m, p, b} B 6 ={m {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} 7 8 An association rule: {m, b} c. Confidence = 2/4 = 50%. CMPT 454: Database Systems II Introduction to Data Mining 18 / 26

Finding Association Rules A typical question: find all association rules with support s and confidence c. Note: support of an association rule is the support of the set of items it mentions. Hard part: finding the high-support (frequent ) itemsets. Checking the confidence of association rules involving those sets is relatively easy. CMPT 454: Database Systems II Introduction to Data Mining 19 / 26

Classification Learning from Examples Given a set of credit card frauds, can we detect frauds in the future? Examples: credit card frauds Goal: case predictions Many applications Fraud detection, intrusion detection, automatic credit approval, customer relationship management, spam detection, virus detection, CMPT 454: Database Systems II Introduction to Data Mining 20 / 26

Classification: A 2-step Process Model construction: describe/summarize a set of predetermined d classes Training dataset: tuples for model construction Each tuple/sample belongs to a predefined class Model: classification rules, decision trees, or math formulae Model application: classify unseen objects Estimate accuracy of the model using an independent test set Acceptable accuracy -> apply the model to classify tuples with unknown class labels CMPT 454: Database Systems II Introduction to Data Mining 21 / 26

Model Construction CMPT 454: Database Systems II Introduction to Data Mining 22 / 26

Model Application CMPT 454: Database Systems II Introduction to Data Mining 23 / 26

An Example: Decision Tree CMPT 454: Database Systems II Introduction to Data Mining 24 / 26

Clustering: Applications Customer segmentation How to partition customers into groups so that customers in each group are similar, while customers in different groups are dissimilar? Pattern recognition in image How to identify objects in a satellite image? The pixels of an object are similar il to each other in some way CMPT 454: Database Systems II Introduction to Data Mining 25 / 26

What is Clustering? CMPT 454: Database Systems II Introduction to Data Mining 26 / 26