Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Similar documents
Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

9. Conclusions. 9.1 Definition KDD

Introduction to Data Mining and Data Analytics

D B M G Data Base and Data Mining Group of Politecnico di Torino

COMP 465 Special Topics: Data Mining

Unsupervised Machine Learning and Data Mining

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Data mining fundamentals

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Knowledge Discovery in Data Bases

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

Knowledge Discovery and Data Mining

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Question Bank. 4) It is the source of information later delivered to data marts.

Introduction to Data Mining

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Chapter 1, Introduction

Python With Data Science

INTRODUCTION TO DATA MINING

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Winter Semester 2009/10 Free University of Bozen, Bolzano

Data Mining Course Overview

Applications and Trends in Data Mining

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

Overview of Web Mining Techniques and its Application towards Web

DATA MINING II - 1DL460

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

1. Inroduction to Data Mininig

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

DATA MINING II - 1DL460

Data warehouse and Data Mining

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Next Steps in Data Mining. Sistemas de Apoio à Decisão Cláudia Antunes

Pre-Requisites: CS2510. NU Core Designations: AD

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

A SURVEY OF DATA MINING & ITS APPLICATIONS

Now, Data Mining Is Within Your Reach

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Evaluation of Machine Learning Algorithms for Satellite Operations Support

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

COMP 6838 Data MIning

Oracle9i Data Mining. Data Sheet August 2002

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Data Mining & Feature Selection

Database and Knowledge-Base Systems: Data Mining. Martin Ester

3 Data, Data Mining. Chengkai Li

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Parametric Comparisons of Classification Techniques in Data Mining Applications

Statistical Learning and Data Mining CS 363D/ SSC 358

Data Mining Concepts

Dynamic Data in terms of Data Mining Streams

ML 프로그래밍 ( 보충 ) Scikit-Learn

A Survey of Statistical Modeling Tools

C00-2. 資料探勘簡介 Data Mining 吳漢銘國立臺北大學統計學系.

Data Analytics Training Program using

data-based banking customer analytics

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Machine Learning & Data Mining

Big Data Analytics The Data Mining process. Roger Bohn March. 2017

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

Data Analytics Training Program

Data Mining. Allan Tucker School of Information Systems Computing and Mathematics Brunel University, London. UB8 3PH. UK

Introduction to Data Mining

Data Management Glossary

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Record Linkage using Probabilistic Methods and Data Mining Techniques

Data Mining An Overview ITEV, F /18

Introduction to MapReduce (cont.)

Data Warehousing and Machine Learning

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES

The 2018 (14th) International Conference on Data Science (ICDATA)

The Data Mining usage in Production System Management

ENTERPRISE MINER: 1 DATA EXPLORATION AND VISUALISATION

Dta Mining and Data Warehousing

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining. Data Analysis and Knowledge Discovery

Transcription:

Knowledge Discovery Javier Béjar cbea URL - Spring 2018 CS - MIA 1/22

Knowledge Discovery (KDD)

Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics to large amounts of data Problem: the impossible task of manually analyzing (make sense of) all the data we are systematically collecting Useful for automating/helping the process of analysis/discovery Final goal: to extract (semi)automatically actionable/useful knowledge We are drowning in information and starving for knowledge URL - Spring 2018 - MAI 2/22

Knowledge Discovery in Databases The high point of KDD starts around late 1990s Many companies show their interest in obtaining the (possibly) valuable information stored in their databases (purchase transactions, e-commerce, web data,...) The area has moved/integrated/transmuted several times to include several sometimes interchangeable terms: Business Intelligence, Business Analytics, Predictive Analytics, Data Science, Big Data... The Venn Diagram Wars URL - Spring 2018 - MAI 3/22

Venn Wars: The funny one URL - Spring 2018 - MAI 4/22

Venn Wars: The job description one URL - Spring 2018 - MAI 5/22

Venn Wars: The scientific one URL - Spring 2018 - MAI 6/22

Venn Wars: The multidisciplinary one URL - Spring 2018 - MAI 7/22

KDD definitions It is the search of valuable information in great volumes of data It is the explorations and analysis, by automatic or semiautomatic tools, of great volumes of data in order to discover patterns and rules It is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data URL - Spring 2018 - MAI 8/22

Elements of KDD Pattern: Any representation formalism capable to describe the common characteristics of data Valid: A pattern is valid if it is able to predict the behaviour of new information with a degree of certainty Novelty: It is novel any knowledge that it is not know respect the domain knowledge and any previous discovered knowledge URL - Spring 2018 - MAI 9/22

Elements of KDD Useful: New knowledge is useful if it allows to perform actions that yield some benefit given a established criteria Understandable: The knowledge discovered must be analyzed by an expert in the domain, in consequence the interpretability of the result is important URL - Spring 2018 - MAI 10/22

The KDD process

KDD as a process The actual discovery of patterns is only one part of a more complex process Raw data in not always ready for processing (80/20 project effort) Some general methodologies have been defined for the whole process (CRISP-DM or SEMMA) These methodologies address KDD as an engineering process, despite being business oriented are general enough to be applied on any data discovery domain URL - Spring 2018 - MAI 11/22

The KDD process Steps of the Knowledge Discovery in DB process 1. Domain study 2. Creating the dataset 3. Data preprocessing 4. Dimensionality reduction 5. Selection of the discovery goal 6. Selection of the adequate methodologies 7. Data Mining 8. Result assessment and interpretation 9. Using the knowledge URL - Spring 2018 - MAI 12/22

Goals of the KDD process There are different goals that can be pursued as the result of the discovery process, among them: Classification: We need models that allow to discriminate instances that belong to a previously known set of groups (the model could or could not be interpretable) Clustering/Partitioning/Segmentation: We need to discover models that clusters the data into groups with common characteristics (a characterizations of the groups is desirable) URL - Spring 2018 - MAI 13/22

Goals of the KDD process Regression: We look for models that predicts the behaviour of continuous variables as a function of others Summarization: We look for a compact description that summarizes the characteristics of the data Causal dependence: We need models that reveal the causal dependence among the variables and assess the strength of this dependence URL - Spring 2018 - MAI 14/22

Goals of the KDD process Structure dependence: We need models that reveal patterns among the relations that describe the structure of the data Change: We need models that discover patterns in data that has temporal or spatial dependence URL - Spring 2018 - MAI 15/22

Methodologies for KDD There are a lot of methodologies that can be applied in the discovery process, the more usual are: Decision trees, decision rules: Usually are interpretable models Can be used for: Classification, regression, and summarization trees: C4.5, CART, QUEST, rules: RIPPER, CN2,.. Classifiers, Regression: Low interpretability but good accuracy Can be used for: Classification and regression Statistical regression, function approximation, Neural networks, Support Vector Machines, k-nn, Local Weighted Regression,... URL - Spring 2018 - MAI 16/22

Methodologies for KDD Clustering: Its goal is to partition datasets or discover groups Can be used for: Clustering, summarization Statistical Clustering, Unsupervised Machine learning, Unsupervised Neural networks (Self-Organizing Maps) Dependency models (attribute dependence, temporal dependence, graph substructures) Its goal is to obtain models (some interpretables) of the dependence relations (structural, causal temporal) among attributes/instances Can be used for: causal dependence discovery, temporal change, substructure discovery Bayesian networks, association rules, Markov models, graph algorithms,... URL - Spring 2018 - MAI 17/22

Applications

Applications Business: Costumer segmentation, costumer profiling, costumer transaction data, customer churn Fraud detection Control/analysis of industrial processes e-commerce, on-line recommendation Financial data (stock market analysis) WEB mining Text mining, document search/organization Social networks analysis User behavior URL - Spring 2018 - MAI 18/22

Applications Scientific applications: Medicine (patient data, MRI scans, ECG, EEG,...) Pharmacology (Drug discovery, screening, in-silicon testing) Astronomy (astronomical bodies identification) Genetics (gen identification, DNA microarrays, bioinformatics) Satellite/Probe data (meteorology, astronomy, geological,...) Large scientific experiments (CERN LHC, ITER) URL - Spring 2018 - MAI 19/22

Tools

Tools for KDD Tools developed at universities (C5.0, CART/MARS) have become commercial products, others still remain open source (Weka, R, scikit-learn) Data analysis software companies incorporate KDD techniques inside classical data analysis tools (SPSS, SAS) Companies selling databases add KDD tools as an added value (IBM DB2 (intelligent Miner), SQL Server, Oracle) Machine Learning as a Service (Amazon, Microsoft, Google, IBM Watson, Big ML,...) URL - Spring 2018 - MAI 20/22

Tools for the course Python General programming language, easy to learn numpy, scipy, pandas scikit-learn (http://scikit-learn.org) Data preprocessing, Clustering Algorithms, Association Rules,... R (http://cran.r-project.org/) Statistic analysis oriented language, more steep learning curve Many packages Data preprocessing, Clustering Algorithms, Association Rules,... URL - Spring 2018 - MAI 21/22

Challenges

Open problems Scalability (More data, more attributes) Overfitting (Patterns with low interest) Statistical significance of the results Methods for temporal data/relational data/structured data Methods for data cleaning (Missing data and noise) Pattern comprehensibility Use of domain knowledge Integration with other techniques (OLAP, DataWarehousing, Business Intelligence, Intelligent Decision Support Systems) Privacy URL - Spring 2018 - MAI 22/22