KNOWLEDGE DISCOVERY AND DATA MINING

Similar documents
Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Data Mining Course Overview

COMP90049 Knowledge Technologies

Data Mining Concepts

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Knowledge Discovery in Data Bases

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

1. Inroduction to Data Mininig

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

TIM 50 - Business Information Systems

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Knowledge Discovery and Data Mining

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

TIM 50 - Business Information Systems

Introduction to Data Mining and Data Analytics

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

Interestingness Measurements

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Data warehouses Decision support The multidimensional model OLAP queries

Data mining fundamentals

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Foundation of Data Mining: Introduction

Knowledge Discovery and Data Mining

Data warehouse and Data Mining

CSE4334/5334 DATA MINING

Classification. Instructor: Wei Ding

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Chapter 3: Data Mining:

Chapter 2 BACKGROUND OF WEB MINING

Data Mining Concepts & Techniques

Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

D B M G Data Base and Data Mining Group of Politecnico di Torino

DATA MINING TRANSACTION

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Part I. Instructor: Wei Ding

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining An Overview ITEV, F /18

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Chapter 3: Supervised Learning

Oracle9i Data Mining. Data Sheet August 2002

Chapter 4 Data Mining A Short Introduction

Preprocessing DWML, /33

TECNOLOGIES FOR INFORMATION SYSTEMS

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

COMP 465 Special Topics: Data Mining

Visualization and text mining of patent and non-patent data

Introduction to Data Mining

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Data Mining & Machine Learning F2.4DN1/F2.9DM1

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

DATA WAREHOUSING AND MINING UNIT-V TWO MARK QUESTIONS WITH ANSWERS

An Introduction to Data Mining

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining Clustering

Enterprise Informatization LECTURE

Data Preprocessing. Slides by: Shree Jaswal

DATA WAREHOUING UNIT I

Data Preprocessing. Data Preprocessing

Data mining techniques for actuaries: an overview

Grid Computing Systems: A Survey and Taxonomy

Customer Clustering using RFM analysis

Association Rule Mining and Clustering

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Question Bank. 4) It is the source of information later delivered to data marts.

Binary Association Rule Mining Using Bayesian Network

Value Added Association Rules

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)

Table Of Contents: xix Foreword to Second Edition

Data Mining Classification - Part 1 -

CS Machine Learning

Chapter 1, Introduction

Classification and Regression

A SURVEY OF IMAGE MINING TECHNIQUES AND APPLICATIONS

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Data Warehousing and Machine Learning

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

Hierarchical Online Mining for Associative Rules

Dta Mining and Data Warehousing

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

Knowledge Modelling and Management. Part B (9)

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Flexible Pattern Management within PSYCHO

UNIT 2 Data Preprocessing

Database Management Systems MIT Lesson 01 - Introduction By S. Sabraz Nawaz

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Transcription:

KNOWLEDGE DISCOVERY AND DATA MINING Prof. Fabio A. Schreiber Dipartimento di Elettronica e Informazione Politecnico di Milano

INFORMATION MANAGEMENT TECHNOLOGIES DATA WAREHOUSE DECISION SUPPORT SYSTEMS DATA MINING INFORMATION SYSTEMS ANALYSIS DATA INTEGRATION DISTRIBUTED ETHEROGENEOUS DATA MANAGEMENT WEB INFORMATION SYSTEMS REAL-TIME MAIN MEMORY TEMPORAL DATABASES NON STRUCTURED SEMISTRUCTURED AND MULTIMEDIAL INFORMATION EMBEDDED SISTEMS MOBILE AND CONTEXT- AWARE COMPONENTS INFORMATION RETRIEVAL SISTEMS Fabio A. Schreiber data mining 1

KNOWLEDGE DISCOVERY EVERY HUMAN KNOWLEDGE STARTS FROM INTUITIONS, PROCEEDS THROUGH CONCEPTS, AND REACHES ITS CLIMAX WITH IDEAS I. Kant Fabio A. Schreiber data mining 2

KNOWLEDGE HIERARCHY VARIABLES INVOICES SALES TREND MARKET RULES STRATEGIC DECISIONS ELEMENTS (VOLUME) DATA INFORMATION KNOWLEDGE WISDOM VALUE ADDED STATISTICAL PROCESSING KNOWLEDGE DISCOVERY PROCEDURES EXPERIENCE Fabio A. Schreiber data mining 3

KNOWLEDGE DISCOVERY AND DATA MINING KNOWLEDGE DISCOVERY IN DATABASES AND DATA WAREHOUSES TO IDENTIFY THE MOST SIGNIFICANT INFORMATION TO SHOW IT TO THE USER IN THE MOST CONVENIENT WAY DATA MINING ALGORITHM APPLICATION TO RAW DATA IN ORDER TO EXTRACT KNOWLEDGE (RELATIONS, PATHS, ) PREDICTIVE AIM (SIGNAL ANALYSIS, VOICE RECOGNITION, ECC.) DESCRIPTIVE AIM (DECISION SUPPORT SYSTEMS) Fabio A. Schreiber data mining 4

KNOWLEDGE DISCOVERY PROCESS (1) EVEN IF SPECIALIZED TOOLS ARE AVAILABLE IT REQUIRES A COMPETENCE IN USED TECHNIQUES A VERY GOOD APPLICATION DOMAIN KNOWLEDGE SEQUENTIAL STEPS SELECTION CHOICE OF THE SAMPLE DATA THE ANALYSIS SHALL BE FOCUSED ON PREPROCESSING DATA SAMPLING IN ORDER TO REDUCE THEIR VOLUME DATA SCRUBBING FOR ERRORS AND OMISSIONS Fabio A. Schreiber data mining 5

KNOWLEDGE DISCOVERY PROCESS (2) TRANSFORMATION DATA TYPES HOMOGENEIZATION AND/OR CONVERSION DATA MINING CHOICE OF THE METHOD/ALGORITHM INTERPRETATION AND EVALUATION RETRIEVED INFORMATION FILTERING POSSIBLE REFINING BY PREVIOUS STEPS REPETITION SEARCH RESULTS VISUAL PRESENTATION (GRAPHICAL OR LOGICAL) Fabio A. Schreiber data mining 6

KNOWLEDGE DISCOVERY PROCESS (3) DATA MINING INTERPRETATION KNOWLEDGE TRANSFORMATION PRE-PROCESSING PRE- PROCESSED DATA TRANSFORMED DATA CORRELATIONS AND PATHS SELECTION TARGET DATA RAW DATA da: G. Piatesky-Shapiro 1996 Fabio A. Schreiber data mining 7

DATA MINING APPLICATIONS CORRESPONDENCE AND RETAIL SALES WHICH SPECIAL OFFER SHOULD BE MADE HOW TO LOCATE GOODS ON THE SHELVES MARKETING SALES FORECASTS PRODUCTS BUYING PATHS BANKS LOANS CONTROL CREDIT CARDS USAGE (AND MISUSE) TELECOMMUNICATIONS SPECIAL FARES Fabio A. Schreiber data mining 8

DATA MINING APPLICATIONS ASTRONOMY AND ASTROPHYSICS STARS AND GALAXIES CLASSIFICATION CHEMICAL AND PHARMACEUTIC RESEARCH NEW COMPOUNDS DISCOVERY INTERRELATIONS AMONG CHEMICAL COMPOUNDS MOLECULAR BIOLOGY PATTERNS IN GENETICAL DATA AND IN MOLECULAR STRUCTURES REMOTE SURVEY AND METEOROLOGY SATELLITE DATA ANALYSIS ECONOMIC AND DEMOGRAPHIC STATISTICS CENSUS DATA ANALYSIS Fabio A. Schreiber data mining 9

DATA MINING ALGORITHMS STRUCTURE MODEL REPRESENTATION FORMALISMS TO REPRESENT AND DESCRIBE POSSIBLE PATHS MODEL EVALUATION STATISTICAL OR LOGICAL ESTIMATE OF THE CORRESPONDENCE OF A PATH TO THE SEARCH CRITERIA SEARCH METHOD OF PARAMETERS SEARCH OF THE PARAMETERS WHICH OPTIMIZE THE EVALUATION CRITERIA, THE OBSERVATIONS SET AND THE MODEL REPRESENTATION BEING GIVEN OF MODEL THE PARAMETERS ARE APPLIED TO MODELS BELONGING TO THE SAME FAMILY, DIFFERENTIATED BY THE REPRESENTATION TYPE, FOR QUALITY EVALUATION Fabio A. Schreiber data mining 10

WHAT KIND OF INFORMATION DO WE GET? ASSOCIATIONS SET OF RULES SPECIFYING THE JOINT OCCURRENCE OF TWO (OR MORE) ELEMENTS SEQUENCES POSSIBILITY OF STATING TEMPORAL SEQUENCES OF EVENTS CLASSIFICATIONS GROUPING OF ELEMENTS INTO CLASSES FOLLOWING A GIVEN MODEL CLUSTERS GROUPING OF ELEMENTS INTO CLASSES WHICH HAVE NOT BEEN DEFINED A-PRIORI TRENDS DISCOVERY OF PECULIAR TEMPORAL PATHS HAVING A FORECASTING VALUE Fabio A. Schreiber data mining 11

DATA MINING TECHNIQUES AND MODELS NEURAL NETWORKS VERY POWERFUL TOOL LEARNING CAPABILITY OPAQUE BEHAVIOUR INDUCTION DECISION TREES (HIERARCHY OF if - then DECISIONS) RULE INDUCTION (NON HIERARCHICAL SETS DERIVED FROM DECISION TREES) SELF EXPLAINING Fabio A. Schreiber data mining 12

DATA MINING TECHNIQUES AND MODELS RULES DISCOVERY ASSOCIATION RULES (SIMULTANEOUS EVENTS X Y) SEQUENTIAL ASSOCIATIONS DATA VISUALIZATION DATA ARE PREPARED AND GRAPHICALLY PRESENTED IN ORDER TO EVIDENTIATE POSSIBLE IRREGULARITIES OR STRANGE PATTERNS Fabio A. Schreiber data mining 13

DATA MINING TECHNIQUES TECHNIQUES INFORMATION ASSOCIATIONS SEQUENCES CLASSIFICATIONS CLUSTERS TRENDS NEURAL NETWORKS INDUCTION RULES DISCOVERY DATA VISUALIZATION Fabio A. Schreiber data mining 14

THE MARKET BASKET THE BEST-KNOWN MODEL ON WHICH DATA MINING TECHNIQUES ARE APPLIED MAINLY, BUT NOT EXCLUSIVELY, USED FOR RETAIL SALE PROBLEMS THE GOAL IS TO DISCOVER RECURRENT PATTERNS IN DATA (ASSOCIATION RULES) Fabio A. Schreiber data mining 15

I = {i 1,..., i k } SET OF k ELEMENTS (ITEM) B = {b 1,..., b n }SET OF n SUBSETS (BASKET) OF I b i I THE MARKET BASKET I GOODS IN A SUPERMARKET WORDS IN A DICTIONARY B A CUSTOMER S PURCHASE A DOCUMENT IN A CORPUS ASSOCIATION RULE i 1 i 2 i 1 AND i 2 SHOW TOGETHER IN AT LEAST s% OF THE n BASKET (SUPPORT) OF ALL THE BASKETS CONTAINING i 1 AT LEAST c% CONTAIN ALSO i 2 (CONFIDENCE) Fabio A. Schreiber data mining 16

EXAMPLES OF RULES RULES HAVING DIET COKE AS CONSEQUENT HELP IN DEFINING THE STRATEGIES TO INCREASE THE SALES OF A SPECIFIC PRODUCT RULES HAVING GRISSINI AMONG THEIR ANTECEDENTS HELP IN UNDERSTANDING THE IMPACT OF CEASING A SPECIFIC PRODUCT SALE RULES HAVING WURSTEL AMONG THEIR ANTECEDENTS AND MUSTARD AS CONSEQUENT EVIDENTIATE THE PRODUCT COUPLINGS IN THE ANTECEDENT INDUCING THE SALE OF THE CONSEQUENT Fabio A. Schreiber data mining 17

EXAMPLES OF RULES RULES CONNECTING ELEMENTS IN SHELF A TO ELEMENTS IN SHELF B HELP TO AN EFFECTIVE ALLOCATION OF PRODUCTS ON THE SHELVES THE BEST k RULES HAVING GRISSINI IN THE CONSEQUENT IN TERMS OF CONFIDENCE IN TERMS OF SUPPORT Fabio A. Schreiber data mining 18

ANY PROBLEM? c COFFEE IS IN THE BASKET c NO COFFEE IN THE BASKET t TEA IS IN THE BASKET t NO TEA IN THE BASKET t t Σcolumns c c 20 5 2 5 Σ rows 70 5 75 90 10 100 t c IS TRUE??? s = 20% c = P[t c] / P[t] =20/25= 80% PERHAPS, BUT... THOSE WHO BUY COFFEE ANYHOW REACH 90%!!! WARNING! A CORRELATION EXISTS BETWEEN TEA AND COFFEE r = P[t c] / (P[t] x P[c] ) = 0.89 Fabio A. Schreiber data mining 19

DATA MINING IN A RELATIONAL ENVIRONMENT SQL EXTENSION BY INTEGRATING DATA MINING OPERATORS INTO THE LANGUAGE INTEGRATION OF AN SQL SERVER WITH A DATA MINING ENGINE MINE RULE FOR ASSOCIATION RULES MINE CLASSIFICATION FOR CLASSIFICATION PROBLEMS MINE INTERVAL FOR DISCRETIZING CONTINUOUS ATTRIBUTES Fabio A. Schreiber data mining 20

MINE RULE: EXAMPLE MINE RULE SimpleAssociation AS SELECT DISTINCT 1..n item AS BODY, 1..1 item AS HEAD, SUPPORT, CONFIDENCE FROM Purchase GROUP BY transaction EXTRACTING RULES WITH SUPPORT: 0.1, CONFIDENCE: 0.2 TR CLIENT ITEM DATE PRICE QUANT. 1 Rossi Ski-trousers 17/12 90.000 1 Rossi Climbing-boots 17/12 180.000 1 2 3 Bianchi Bianchi Bianchi Rossi Sport shirt Brown shoes Jacket Jacket 18/12 18/12 18/12 18/12 70.000 250.000 400.000 400.000 2 1 1 1 4 Bianchi Sport shirt 19/12 70.000 3 Bianchi Jacket 19/12 400.000 2 Fabio A. Schreiber data mining 21

MINE RULE: EXAMPLE BODY HEAD SUPPORT CONFIDENCE Ski-trousers Climbing-boots Sport shirt Sport shirt Brown shoes Brown shoes Jacket Climbing-boots Ski-trousers Brown shoes Jacket Sport shirt Jacket Sport shirt 0.25 0.25 0.25 0.5 0.25 0.25 0.5 1 1 0.5 1 1 1 0.66 Jacket Brown shoes 0.25 0.33 Sport shirt, Brown shoes Jacket 0.25 1 Sport shirt, Jacket Brown shoes 0.25 0.5 Brown shoes, Jacket Sport shirt 0.25 1 Fabio A. Schreiber data mining 22

CLASSIFICATION PROBLEM EACH ELEMENT (RECORD) OF A DATA SET IS ASSOCIATED WITH A DISTINCTIVE FEATURE CALLED CLASS, ON THE BASIS OF PREVIOUS EXPERIENCES AND OBSERVATIONS PHASE 1: TRAINING A MODEL IS BUILT ON THE KNOWN SET (TRAINING SET) IN SUCH A WAY AS EACH CLASS IS A PARTITION OF THE SET PHASE 2: APPLICATION THE MODEL IS USED TO CLASSIFY NEW DATA Fabio A. Schreiber data mining 23

CLASSIFICATION PROBLEM : AN EXAMPLE PHASE 1: TRAINING MINE CLASSIFICATION CarInsuranceRules AS SELECT DISTINCT RULES ID, *, CLASS FROM CarInsurance CLASSIFY BY Risk PHASE 2: APPLICATION MINE CLASSIFICATION TEST ClassifiedApplicants AS SELECT DISTINCT *, CLASS FROM Applicants USING CLASSIFICATION FROM CarInsuranceRules AS RULES Fabio A. Schreiber data mining 24

CLASSIFICATION PROBLEM : AN EXAMPLE AGE CAR TYPE RISK 17 sports high 43 family low 68 family low 32 truck low 23 family high 18 family high 20 family high 45 sports high 50 truck low 64 truck high 46 family low 40 family low 1. IF Age 23 THEN Risk IS High; 2. IF CarType = sports THEN Risk IS High; 3. IF CarType IN {family, truck} AND Age > 23 THEN Risk IS Low; 4. DEFAULT Risk IS Low AGE CAR TYPE 22 family 60 family 35 sports AGE CAR TYPE 22 family 60 family 35 sports MINE CLASSIFICATION TEST CLASS high low high Fabio A. Schreiber data mining 25

DISCRETIZATION PROBLEM ONE MUST TRANSFORM A NUMERIC ATTRIBUTE IN A CATHEGORIAL ONE BY DIVIDING THE NUMERIC DOMAIN INTO A SET OF INTERVALS, EACH OF WHICH IS LABELLED WITH A CLASS LABEL, CONSTITUTING THE CATHEGORIAL DOMAIN DISCRETIZATION METHODS NON SUPERVISIONED ONLY THE VALUES DISTRIBUTION OF THE ATTRIBUTE IS CONSIDERED SUPERVISIONED AN EFFORT IS MADE TO KEEP AS MUCH INFORMATION AS POSSIBLE ABOUT THE CLASSES OF A TUPLE ASSOCIATED ATTRIBUTE Fabio A. Schreiber data mining 26

DISCRETIZATION PROBLEM SIMPLE DISCRETIZATION SAME WIDTH INTERVALS THE NUMERIC DOMAIN IS DIVIDED INTO A GIVEN NUMBER OF EQUAL WIDTH INTERVALS SAME FREQUENCY INTERVALS INTERVALS ARE NARROWER WHERE VALUES ARE DENSE AND WIDER WHERE VALUES ARE SPARSE MINE INTERVAL SameWidthInterval AS SELECT DISTINCT ID, LOWER, UPPER GENERATING DiscreteCarInsurance FROM CarInsurance DISCRETIZE Age BY WIDTH USING 3 INTERVALS Fabio A. Schreiber data mining 27

DISCRETIZATION PROBLEM : AN EXAMPLE ID LOWER UPPER AGE CAR TYPE RISK 17 sports high 43 family low 68 family low 32 truck low 23 family high 18 family high 20 family high 45 sports high 50 truck low 64 truck high 46 family low 40 family low 1 17 34 2 34 52 3 52 68 AGE CAR TYPE RISK 1 sports high --- ----- ---- 1 truck low 2 family low --- ----- ---- 2 truck low 3 ----- ---- 3 family low Fabio A. Schreiber data mining 28