Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

Similar documents
Data warehouses Decision support The multidimensional model OLAP queries

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining Concepts

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Data warehouse and Data Mining

Supervised and Unsupervised Learning (II)

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

Association mining rules

Rule induction. Dr Beatriz de la Iglesia

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Association Rule Mining. Entscheidungsunterstützungssysteme

CSE4334/5334 DATA MINING

Association Rules Outline

Data Mining Algorithms

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Association Rule Mining. Introduction 46. Study core 46

Classification by Association

Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Interestingness Measurements

Jarek Szlichta

2. Discovery of Association Rules

Tutorial on Association Rule Mining

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Association Rules. Berlin Chen References:

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Slice Intelligence!

Final Exam DATA MINING I - 1DL360

Frequent Pattern Mining

CS Machine Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Classification. Instructor: Wei Ding

Nesnelerin İnternetinde Veri Analizi

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Association Pattern Mining. Lijun Zhang

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

COMP 465: Data Mining Classification Basics

Introduction to Data Mining and Data Analytics

Data Mining Concepts & Techniques

Classification and Regression

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Chapter 4 Data Mining A Short Introduction

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Data Mining Clustering

Association rule mining

Nearest neighbor classification DSE 220

Data Mining Classification - Part 1 -

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017

Knowledge Discovery and Data Mining

Association Rules. CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University. Slides adapted from lectures by Jeff Ullman

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

KNOWLEDGE DISCOVERY AND DATA MINING

Business Intelligence. Tutorial for Performing Market Basket Analysis (with ItemCount)

Association Rules Apriori Algorithm

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

BCB 713 Module Spring 2011

Decision Tree Learning

Data Mining Lecture 8: Decision Trees

Data Mining and Knowledge Discovery: Practice Notes

Decision Support Systems

Interestingness Measurements

An Algorithm for Frequent Pattern Mining Based On Apriori

CSC 261/461 Database Systems Lecture 26. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Basic Data Mining Technique

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Jeffrey D. Ullman Stanford University

Knowledge Discovery and Data Mining

Mining Association Rules in Large Databases

Data Mining Part 3. Associations Rules

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Association Rules Apriori Algorithm

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Algorithms: Decision Trees

Lecture 5: Decision Trees (Part II)

Introduction to Data Mining

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Big Data Analytics CSCI 4030

Classification: Decision Trees

Part I. Instructor: Wei Ding

Random Forest A. Fornaser

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining Part 5. Prediction

signicantly higher than it would be if items were placed at random into baskets. For example, we

Machine Learning Techniques for Data Mining

Transcription:

Outline Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone Project Update Data Mining: Answers without Queries Patterns and statistics Finding frequent item sets Classification and regression trees Project Update One week left My office hours tomorrow 4-6 Yangzhe s office hours 7-9 Wednesday: consultation with Vlad in lieu of recitation Project Update Hand in Monday by 6. Email to mds. URL of working system Zip/Tar file of code Suggested tour Make sure you have something working by Wednesday! Project Update Useful tool: sessions HttpSession s = request.getsession(); Object o = s.getattribute( attribute ); s.setattribute( another, o); (Sessions will get lost when server restarts.) Data Mining SQL is about answering specific questions What if you don t know question to ask? What s interesting about this data? What s going on here? What happens a lot? Data mining!

Limits of Data Mining Randomness Some things just happen for no reason In large data sets, you may see this a lot Limits of Data Mining Sparse data Beware of breaking up data The amount of data available decreases exponentially in number of constraints Human in the loop Selecting data to explore Cleaning data Minimizing noise, outliers, discrepancies in format, organizing data into new and better tables Evaluating results Understanding what s happening Explaining it to the boss Finding frequent item sets Problem of associations What items go together in a table Example: market basket What items tended to be bought together? Finding frequent item sets Sample table: transact item transact item 111 pen 113 pen 111 ink 113 milk 111 milk 114 pen 111 juice 114 ink 112 pen 114 juice 112 ink 114 water 112 milk Factoids In 75% of transactions a pen and ink are purchased together In 25% of transactions milk and juice are purchased together

Definition Itemset: a set of items Support: the fractions of transactions that contain all the items in the itemset Frequent itemsets: all itemsets whose support exceeds some threshold Example Frequent itemsets at 70% {pen}, {ink}, {milk}, {pen,ink}, {pen,milk} Efficient Algorithm Key property Every subset of a frequent itemset is also a frequent itemset Algorithm step 1 Identify the frequent itemsets with one item Algorithm step 1 Identify the frequent itemsets with one item select item from table group by item having count(*) > threshold Algorithm step 2 Iteratively Try to build larger frequent itemsets out of the ones you ve found already

Algorithm step 2 For each new frequent itemset Ik with k items generate all itemsets I(k+1) with k+1 items Scan all the transactions once Check if the new itemsets are frequent Set k=k+1 Algorithm step 3 Stop when no new frequent itemsets are identified Finding frequent item sets Sample table: transact item transact item 111 pen 113 pen 111 ink 113 milk 111 milk 114 pen 111 juice 114 ink 112 pen 114 juice 112 ink 114 water 112 milk Mining for association rules Association rules LHS RHS LHS is a set of items RHS is a set of items Example {pen} {ink} If a pen is purchased in a xact, it is likely that ink is also purchased in that xact. Measures Support Percentage of xacts that have LHS RHS Confidence Percentage of LHS xacts that also have RHS Support of (LHS RHS) / Support of LHS Finding them First, find frequent itemsets Create possible rules from frequent itemsets Keep those with high confidence

Example {pen, milk} Support is 75% {pen} {milk} Confidence is 75% {milk} {pen} Confidence is 100% Statistical Perspective Is L R surprising? Statistical independence Support tells us: P(L & R) P(L) P(R) Not so interesting if P(L & R) = P(L) * P(R) Correlation and prediction Want L R to be associated with causality Basic idea of causality: Even if we intervene to change how value of L is determined We still get the same correlation with R. Correlation and Prediction For example, with {pen} {ink} If we change why people buy pens, we still want them to buy ink too. For example, we can lower the price of pens. Problem Things can go together for other reasons CART Classification and regression trees

Tree structured rules Node either makes prediction E.g., classify into a particular class Or looks at a variable/field Tests its value Discrete fields: test if equals specific case Numerical fields: test if > threshold Recurse Insurance info relation age cartype highrisk 23 sedan false 30 sports false 36 sedan false 25 truck true 30 sedan false 23 truck true 30 truck false 25 sports true 18 sedan false Example Example - visualization Classification tree Age <= 25 Car Type = Sedan Sedan No Yes > 25 No First split Second split First split

Top-down greedy algorithm BuildTree(data D) Find the best split of D into D1 and D2 BuildTree(D1) BuildTree(D2)

Supporting this with SQL Attribute-value Class Sets (AVCs) SELECT attribute, class, COUNT(*) FROM table GROUP BY attribute, class Supporting this with SQL For example SELECT age, highrisk, COUNT(*) FROM InsuranceInfo GROUP BY age, highrisk Supporting this with SQL Gives a slice through the database space Age True False 18 0 1 23 1 1 25 2 0 30 0 3 36 0 1 Supporting this with SQL Enough to find candidate split values And determine how pure each set is

Algorithm with SQL support BuildTree(data D) Scan the data and construct AVC group Use AVC group to split into D1 and D2 BuildTree(D1) BuildTree(D2) Sparse data Overfitting CART and Statistics