Data mining fundamentals

Similar documents
D B M G Data Base and Data Mining Group of Politecnico di Torino

Data Mining Course Overview

DATA MINING II - 1DL460

DATA MINING II - 1DL460

Knowledge Discovery and Data Mining

INTRODUCTION TO DATA MINING

Chapter 1, Introduction

Statistical Learning and Data Mining CS 363D/ SSC 358

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

9. Conclusions. 9.1 Definition KDD

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

COMP90049 Knowledge Technologies

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Data Mining and Data Warehousing Introduction to Data Mining

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

COMP 465 Special Topics: Data Mining

Introduction to Data Mining

Data warehouse and Data Mining

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Data Preprocessing UE 141 Spring 2013

Foundation of Data Mining: Introduction

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

Data Mining Concepts & Techniques

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Question Bank. 4) It is the source of information later delivered to data marts.

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Introduction to Data Mining CS 584 Data Mining (Fall 2016)

Knowledge Discovery in Data Bases

CISC 4631 Data Mining Lecture 01:

A SURVEY OF DATA MINING & ITS APPLICATIONS

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Next Steps in Data Mining. Sistemas de Apoio à Decisão Cláudia Antunes

Chapter 4 Data Mining A Short Introduction

Winter Semester 2009/10 Free University of Bozen, Bolzano

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Answer All Questions. All Questions Carry Equal Marks. Time: 20 Min. Marks: 10.

COMP 6838 Data MIning

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Introduction to Data Mining. Lijun Zhang

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

DATA MINING LECTURE 1. Introduction

Jarek Szlichta

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

Introduction to Data Mining and Data Analytics

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Basic Data Mining Technique

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Research on Data Mining and Statistical Analysis Xiaoyao Lu1, a

CS249: ADVANCED DATA MINING

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Chapter 7: Frequent Itemsets and Association Rules

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Data Mining: Introduction. Lecture Notes for Chapter 1. Introduction to Data Mining

Data Mining Concepts

Includes Review of Syllabus OVERVIEW OF THE CLASS

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

An Effectual Approach to Swelling the Selling Methodology in Market Basket Analysis using FP Growth

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

DATA MINING TRANSACTION

Table Of Contents: xix Foreword to Second Edition

Chapter 3: Data Mining:

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

DATA WAREHOUING UNIT I

Interestingness Measurements

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Gene Clustering & Classification

Oracle9i Data Mining. Data Sheet August 2002

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Frequent Pattern Mining

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Information mining and information retrieval : methods and applications

A Performance Assessment on Various Data mining Tool Using Support Vector Machine

Gene selection through Switched Neural Networks

DATA MINING LECTURE 1. Introduction

Data Mining Clustering

DATA MINING LECTURE 1. Introduction

CSE5243 INTRO. TO DATA MINING

Grading. 20% Activity (course attendance and homework) 40% Project (project attendance, algorithm presentation, project delivery)

Grading. Road Map. Definition ([Liu 11]) Definition ([Wikipedia]) Definition ([Ullman 09, 10])

Data Mining: An experimental approach with WEKA on UCI Dataset

KNOWLEDGE DISCOVERY AND DATA MINING

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Contents. Preface to the Second Edition

DECISION TREE ANALYSIS ON J48 AND RANDOM FOREST ALGORITHM FOR DATA MINING USING BREAST CANCER MICROARRAY DATASET

Transcription:

Data mining fundamentals Elena Baralis Politecnico di Torino Data analysis Most companies own huge bases containing operational textual documents experiment results These bases are a potential source of useful information 2 1

Data analysis Information is hidden in huge sets not immediately evident human analysts need a large amount of time for the analysis most is never analyzed at all 4,000,000 3,500,000 3,000,000 The Data Gap 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 Disk space (TB) since 1995 Analyst number 1995 1996 1997 1998 1999 From R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications 3 Data mining Non trivial extraction of implicit previously unknown potentially useful information from available Extraction is automatic performed by appropriate algorithms Extracted information is represented by means of abstract models denoted as pattern 4 2

Example: biological Microarray expression level of genes in a cellular tissue various types (mrna, DNA) Patient clinical records personal and demographic exam results Textual in public collections heterogeneous formats, different objectives scientific literature (PUBMed) ontologies (Gene Ontology) CLID PATIENT ID shx013: 49A34 shv060: 45A9 shq077: shx009: shx014: shq082: 52A28 4A34 61A31 99A6 shq083: 46A15 shx008: 41A31 IMAGE:740ISG20 int -1.02-2.34 1.44 0.57-0.13 0.12 0.34-0.51 IMAGE:767TNFSF13-0.52-4.06-0.29 0.71 1.03-0.67 0.22-0.09 IMAGE:366LOC93343-0.25-4.08 0.06 0.13 0.08 0.06-0.08-0.05 IMAGE:235ITGA4 int -1.375-1.605 0.155-0.015 0.035-0.035 0.505-0.865 5 Biological analysis objectives Clinical analysis detecting the causes of a pathology monitoring the effect of a therapy diagnosis improvement and definition of new specific therapies Bio-discovery gene network discovery analysis of multifactorial genetic pathologies Pharmacogenesis lab design of new drugs for genic therapies How can mining contribute? 6 3

Data mining contributions Pathology diagnosis classification Selecting genes involved in a specific pathology feature selection clustering Grouping genes with similar functional behavior clustering Multifactorial pathologies analysis association rules Detecting chemical components appropriate for specific therapies classification 7 Knowledge Discovery Process selection preprocessing selected preprocessed transformed transformation pattern KDD = Knowledge Discovery from Data mining interpretation knowledge 8 4

Preprocessing preprocessing cleaning reduces the effect of noise identifies or removes outliers solves inconsistencies selected preprocessed integration reconciles extracted from different sources integrates meta identifies and solves value conflicts manages redundancy Real world is dirty Without good quality, no good quality pattern 9 Data mining origins Draws from statistics, artificial intelligence (AI) pattern recognition, machine learning base systems Traditional techniques are not appropriate because of significant volume large dimensionality heterogeneous and distributed nature of Statistics, AI Data Mining Database systems Machine Learning, Pattern Recognition From: P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining 10 5

Analysis techniques Descriptive methods Extract interpretable models describing Example: client segmentation Predictive methods Exploit some known variables to predict unknown or future values of (other) variables Example: spam email detection 11 Classification Objectives prediction of a class label definition of an interpretable model of a given phenomenon training model unclassified classified 12 6

Classification training Approaches decision trees bayesian classification classification rules neural networks k-nearest neighbours SVM model unclassified classified 13 Classification training Requirements accuracy interpretability scalability noise and outlier management model unclassified classified 14 7

Classification Applications detection of customer propension to leave a company (churn or attrition) fraud detection classification of different pathology types dati di training modello dati non classificati dati classificati 15 Clustering Objectives detecting groups of similar objects identifying exceptions and outliers 16 8

Clustering Approaches partitional (K-means) hierarchical density-based (DBSCAN) SOM Requirements scalability management of noise and outliers large dimensionality interpretability 17 Clustering Applications customer segmentation clustering of documents containing similar information grouping genes with similar expression pattern 18 9

Association rules Objective extraction of frequent correlations or pattern from a transactional base Tickets at a supermarket counter Association rule TID Items diapers beer 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coke, Diapers, Milk 2% of transactions contains both items 30% of transactions containing diapers also contain beer 19 Association rules Applications market basket analysis cross-selling shop layout or catalogue design Tickets at a supermarket counter TID Items 1 Bread, Coca Cola, Milk 2 Beer, Bread 3 Beer, Coca Cola, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coca Cola, Diapers, Milk Association rule diapers beer 2% of transactions contains both items 30% of transactions containing diapers also contain beer 20 10

Other mining techniques Sequence mining ordering criteria on analyzed are taken into account example: motif detection in proteins Time series and geospatial temporal and spatial information are considered example: sensor network Regression prediction of a continuous value example: prediction of stock quotes Outlier detection example: intrusion detection in network traffic analysis Sensor network 21 Open issues Scalability to huge volumes Data dimensionality Complex structures, heterogeneous formats Data quality Privacy preservation Streaming 22 11