Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Similar documents
Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

D B M G Data Base and Data Mining Group of Politecnico di Torino

Introduction to Data Mining and Data Analytics

Data mining fundamentals

9. Conclusions. 9.1 Definition KDD

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

COMP 465 Special Topics: Data Mining

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Question Bank. 4) It is the source of information later delivered to data marts.

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Knowledge Discovery in Data Bases

Introduction to Data Mining

Chapter 1, Introduction

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Knowledge Discovery and Data Mining

Unsupervised Machine Learning and Data Mining

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Winter Semester 2009/10 Free University of Bozen, Bolzano

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

INTRODUCTION TO DATA MINING

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

1. Inroduction to Data Mininig

Next Steps in Data Mining. Sistemas de Apoio à Decisão Cláudia Antunes

DATA MINING II - 1DL460

Overview of Web Mining Techniques and its Application towards Web

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

Data Mining Course Overview

A SURVEY OF DATA MINING & ITS APPLICATIONS

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

DATA MINING II - 1DL460

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Pre-Requisites: CS2510. NU Core Designations: AD

Data Mining & Feature Selection

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Data warehouse and Data Mining

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Applications and Trends in Data Mining

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Database and Knowledge-Base Systems: Data Mining. Martin Ester

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Parametric Comparisons of Classification Techniques in Data Mining Applications

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Statistical Learning and Data Mining CS 363D/ SSC 358

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

Data Mining Concepts

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Big Data Analytics The Data Mining process. Roger Bohn March. 2017

The 2018 (14th) International Conference on Data Science (ICDATA)

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Introduction to Data Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Dynamic Data in terms of Data Mining Streams

Record Linkage using Probabilistic Methods and Data Mining Techniques

C00-2. 資料探勘簡介 Data Mining 吳漢銘國立臺北大學統計學系.

A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES

Machine Learning & Data Mining

Clustering Analysis based on Data Mining Applications Xuedong Fan

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining. Allan Tucker School of Information Systems Computing and Mathematics Brunel University, London. UB8 3PH. UK

Data Management Glossary

DATA WAREHOUING UNIT I

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

DR. JIVRAJ MEHTA INSTITUTE OF TECHNOLOGY

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Now, Data Mining Is Within Your Reach

Gene Clustering & Classification

Table Of Contents: xix Foreword to Second Edition

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Data Clustering With Leaders and Subleaders Algorithm

Oracle9i Data Mining. Data Sheet August 2002

KNOWLEDGE DISCOVERY AND DATA MINING

DATA MINING TRANSACTION

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Machine Learning (CSMML16) (Autumn term, ) Xia Hong

Summary of Last Class. Course Content. Chapter 1 Objectives

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Multivariate discretization of continuous valued attributes.

Artificial Intelligence. Programming Styles

Data Mining and. in Dynamic Networks

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Transcription:

Knowledge Discovery Javier Béjar URL - Spring 2019 CS - MIA

Knowledge Discovery (KDD)

Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics to large amounts of data Problem: the impossible task of manually analyzing all the data we are systematically collecting Useful for automating/helping the process of analysis/discovery Final goal: to extract (semi)automatically actionable/useful knowledge We are drowning in information and starving for knowledge URL - Spring 2019 - MAI 1/19

Knowledge Discovery in Databases The high point of KDD starts around late 1990s Many companies show their interest in obtaining the (possibly) valuable information stored in their databases (purchase transactions, e-commerce, web data,...) The area has moved/integrated/transmuted several times to include several sometimes interchangeable terms: Business Intelligence, Business Analytics, Predictive Analytics, Data Science, Big Data... The Venn Diagram Wars URL - Spring 2019 - MAI 2/19

Venn Wars: The funny one

Venn Wars: The job description one

Venn Wars: The scientific one

Venn Wars: The multidisciplinary one

KDD definitions It is the search of valuable information in great volumes of data It is the explorations and analysis, by automatic or semiautomatic tools, of great volumes of data in order to discover patterns and rules It is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data URL - Spring 2019 - MAI 7/19

Elements of KDD Pattern: Any representation formalism capable to describe the common characteristics of data Valid: A pattern is valid if it is able to predict the behaviour of new information with a degree of certainty Novelty: It is novel any knowledge that it is not know respect the domain knowledge and any previous discovered knowledge URL - Spring 2019 - MAI 8/19

Elements of KDD Useful: New knowledge is useful if it allows to perform actions that yield some benefit given a established criteria Understandable: The knowledge discovered must be analyzed by an expert in the domain, in consequence the interpretability of the result is important URL - Spring 2019 - MAI 9/19

The KDD process

KDD as a process The actual discovery of patterns is only one part of a more complex process Raw data in not always ready for processing (80/20 project effort) Some general methodologies have been defined for the whole process (CRISP-DM or SEMMA) These methodologies address KDD as an engineering process, despite being business oriented are general enough to be applied on any data discovery domain URL - Spring 2019 - MAI 10/19

The KDD process Steps of the Knowledge Discovery in DB process 1. Domain study 2. Creating the dataset 3. Data preprocessing 4. Dimensionality reduction 5. Selection of the discovery goal 6. Selection of the adequate methodologies 7. Data Mining 8. Result assessment and interpretation 9. Using the knowledge URL - Spring 2019 - MAI 11/19

Goals of the KDD process There are different goals that can be pursued as the result of the discovery process, among them: Classification We need models that allow to discriminate instances that belong to a previously known set of groups (the model could or could not be interpretable) Clustering/Partitioning/Segmentation We need to discover models that clusters the data into groups with common characteristics (a characterizations of the groups is desirable) URL - Spring 2019 - MAI 12/19

Goals of the KDD process Regression We look for models that predicts the behaviour of continuous variables as a function of others Summarization We look for a compact description that summarizes the characteristics of the data Causal dependence We need models that reveal the causal dependence among the variables and assess the strength of this dependence URL - Spring 2019 - MAI 13/19

Goals of the KDD process Structure dependence We need models that reveal patterns among the relations that describe the structure of the data Change We need models that discover patterns in data that has temporal or spatial dependence URL - Spring 2019 - MAI 14/19

Methodologies for KDD Decision trees, decision rules Usually are interpretable models Can be used for: Classification, regression, and summarization trees: C4.5, CART, QUEST, rules: RIPPER, CN2,.. Classifiers, Regression Low interpretability but good accuracy Can be used for: Classification and regression Statistical regression, function approximation, Neural networks, Support Vector Machines, k-nn, Local Weighted Regression,... URL - Spring 2019 - MAI 15/19

Methodologies for KDD Clustering Partition datasets or discover groups Can be used for: Clustering, summarization Statistical Clustering, Unsupervised Machine learning, Unsupervised Neural networks (Self-Organizing Maps) Dependency models Obtain models of the dependence relations (structural, causal temporal) among attributes/instances Can be used for: causal dependence discovery, temporal change, substructure discovery Bayesian networks, association rules, Markov models, graphs algorithms,... URL - Spring 2019 - MAI 16/19

Applications

Applications Business Costumer segmentation, costumer profiling, costumer transaction data, customer churn Fraud detection Control/analysis of industrial processes E-commerce, On-line recommendation, Financial data... WEB mining Text mining, document search/organization Social networks analysis User behavior URL - Spring 2019 - MAI 17/19

Applications Scientific applications Medicine (patient data, MRI scans, ECG, EEG,...) Pharmacology (Drug discovery, screening, in-silicon testing) Astronomy (astronomical bodies identification) Genetics (gen identification, DNA microarrays, bioinformatics) Satellite data (meteorology, astronomy, geological,...) Large scientific experiments (CERN LHC, ITER) URL - Spring 2019 - MAI 18/19

Challenges

Open problems Scalability (More data, more attributes) Overfitting (Patterns with low interest) Temporal data/relational data/structured data Methods for data cleaning (Missing data and noise) Pattern comprehensibility Use of domain knowledge Integration with other techniques (OLAP, DataWarehousing, Intelligent Decision Support Systems) Privacy URL - Spring 2019 - MAI 19/19