D B M G Data Base and Data Mining Group of Politecnico di Torino

Similar documents
Data mining fundamentals

DATA MINING II - 1DL460

Data Mining Course Overview

DATA MINING II - 1DL460

Knowledge Discovery and Data Mining

INTRODUCTION TO DATA MINING

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Chapter 1, Introduction

Statistical Learning and Data Mining CS 363D/ SSC 358

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

9. Conclusions. 9.1 Definition KDD

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

COMP 465 Special Topics: Data Mining

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

COMP90049 Knowledge Technologies

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Data Mining and Data Warehousing Introduction to Data Mining

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Question Bank. 4) It is the source of information later delivered to data marts.

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Data warehouse and Data Mining

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Introduction to Data Mining

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

Foundation of Data Mining: Introduction

A SURVEY OF DATA MINING & ITS APPLICATIONS

Data Preprocessing UE 141 Spring 2013

Data Mining Concepts & Techniques

DATA MINING LECTURE 1. Introduction

Introduction to Data Mining and Data Analytics

CISC 4631 Data Mining Lecture 01:

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

Winter Semester 2009/10 Free University of Bozen, Bolzano

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

An Effectual Approach to Swelling the Selling Methodology in Market Basket Analysis using FP Growth

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Knowledge Discovery in Data Bases

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Next Steps in Data Mining. Sistemas de Apoio à Decisão Cláudia Antunes

Introduction to Data Mining CS 584 Data Mining (Fall 2016)

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Based on Big Data: Hype or Hallelujah? by Elena Baralis

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Data Mining Concepts & Tasks

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Introduction to Data Mining. Lijun Zhang

COMP 6838 Data MIning

Chapter 4 Data Mining A Short Introduction

Jarek Szlichta

DATA MINING LECTURE 1. Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

DATA MINING LECTURE 1. Introduction

Includes Review of Syllabus OVERVIEW OF THE CLASS

Table Of Contents: xix Foreword to Second Edition

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Overview of Web Mining Techniques and its Application towards Web

Answer All Questions. All Questions Carry Equal Marks. Time: 20 Min. Marks: 10.

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

DATA MINING INTRO LECTURE. Introduction

Data Mining Concepts

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

CS249: ADVANCED DATA MINING

DATA MINING TRANSACTION

DATA WAREHOUING UNIT I

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

3 Data, Data Mining. Chengkai Li

Data Mining: Introduction. Lecture Notes for Chapter 1. Introduction to Data Mining

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Basic Data Mining Technique

Research on Data Mining and Statistical Analysis Xiaoyao Lu1, a

Data Mining Concepts & Tasks

Mining Web Data. Lijun Zhang

Summary. Machine Learning: Introduction. Marcin Sydow

Chapter 7: Frequent Itemsets and Association Rules

Information mining and information retrieval : methods and applications

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Knowledge Discovery & Data Mining

ISM 50 - Business Information Systems

Oracle9i Data Mining. Data Sheet August 2002

Chapter 3: Data Mining:

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

Interestingness Measurements

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

CSE5243 INTRO. TO DATA MINING

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Transcription:

DataBase and Data Mining Group of Data mining fundamentals Data Base and Data Mining Group of Data analysis Most companies own huge databases containing operational data textual documents experiment results These databases are a potential source of useful information 2 1

DataBase and Data Mining Group of Data analysis Information is hidden in huge datasets not immediately evident human analysts need a large amount of time for the analysis most data is never analyzed at all 4,000,000 3,500,000 3,000,000 The Data Gap 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 Disk space (TB) since 1995 Analyst number 1995 1996 1997 1998 1999 From R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications 3 Data mining Non trivial extraction of implicit previously unknown potentially useful information from available data Extraction is automatic performed by appropriate algorithms Extracted information is represented by means of abstract models denoted as pattern 4 2

DataBase and Data Mining Group of Example: profiling Consumer behavior in e-commerce sites Selected products, requested information, Search engines and portals Query keywords, searched topics and objects Social network data Facebook, google+ profiles Dynamic data: posts on blogs, FB, tweets Maps and georeferenced data Localization, interesting locations for users 5 Example: profiling User/service profiling Recommendation systems Advertisements Market basket analysis Correlated objects for cross selling User registration, fidelity cards Context-aware data analysis Integration of different dimensions E.g., location, time of the day, user interest Text mining Brand reputation, sentiment analysis, topic trends 6 3

DataBase and Data Mining Group of Example: biological data Microarray expression level of genes in a cellular tissue various types (mrna, DNA) Patient clinical records personal and demographic data exam results Textual data in public collections heterogeneous formats, different objectives scientific literature (PUBMed) ontologies (Gene Ontology) CLID PATIENT ID shx013: shv060: shq077: shx009: shx014: shq082: 49A34 45A9 52A28 4A34 61A31 99A6 shq083: 46A15 shx008: 41A31 IMAGE:740ISG20 int -1.02-2.34 1.44 0.57-0.13 0.12 0.34-0.51 IMAGE:767TNFSF13-0.52-4.06-0.29 0.71 1.03-0.67 0.22-0.09 IMAGE:366LOC93343-0.25-4.08 0.06 0.13 0.08 0.06-0.08-0.05 IMAGE:235ITGA4 int -1.375-1.605 0.155-0.015 0.035-0.035 0.505-0.865 7 Biological analysis objectives Clinical analysis detecting the causes of a pathology monitoring the effect of a therapy diagnosis improvement and definition of new specific therapies Bio-discovery gene network discovery analysis of multifactorial genetic pathologies Pharmacogenesis lab design of new drugs for genic therapies 8 4

DataBase and Data Mining Group of Knowledge Discovery Process selection preprocessing data selected data preprocessed data transformed data transformation KDD = Knowledge Discovery from Data pattern data mining interpretation knowledge 10 Preprocessing preprocessing selected data preprocessed data data cleaning reduces the effect of noise identifies or removes outliers solves inconsistencies data integration reconciles data extracted from different sources integrates metadata identifies and solves data value conflicts manages redundancy Real world data is dirty Without good quality data, no good quality pattern 11 5

DataBase and Data Mining Group of Data mining origins Draws from statistics, artificial intelligence (AI) pattern recognition, machine learning database systems Traditional techniques are not appropriate because of significant data volume large data dimensionality heterogeneous and distributed nature of data Statistics, AI Data Mining Database systems Machine Learning, Pattern Recognition From: P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining 12 Analysis techniques Descriptive methods Extract interpretable models describing data Example: client segmentation Predictive methods Exploit some known variables to predict unknown or future values of (other) variables Example: spam email detection 13 6

DataBase and Data Mining Group of Classification Objectives prediction of a class label definition of an interpretable model of a given phenomenon training data model unclassified data classified data 14 Classification training data Approaches decision trees bayesian classification classification rules neural networks k-nearest neighbours SVM model unclassified data classified data 15 7

DataBase and Data Mining Group of Classification training data Requirements accuracy interpretability scalability noise and outlier management model unclassified data classified data 16 Classification Applications detection of customer propension to leave a company (churn or attrition) fraud detection classification of different pathology types dati di training modello dati non classificati dati classificati 17 8

DataBase and Data Mining Group of Clustering Objectives detecting groups of similar data objects identifying exceptions and outliers 18 Clustering Approaches partitional (K-means) hierarchical density-based (DBSCAN) SOM Requirements scalability management of noise and outliers large dimensionality interpretability 19 9

DataBase and Data Mining Group of Clustering Applications customer segmentation clustering of documents containing similar information grouping genes with similar expression pattern 20 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coke, Diapers, Milk Association rule diapers beer 2% of transactions contains both items 30% of transactions containing diapers also contain beer 21 10

DataBase and Data Mining Group of Association rules Applications market basket analysis cross-selling shop layout or catalogue design Tickets at a supermarket counter TID Items 1 Bread, Coca Cola, Milk 2 Beer, Bread 3 Beer, Coca Cola, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coca Cola, Diapers, Milk Association rule diapers beer 2% of transactions contains both items 30% of transactions containing diapers also contain beer 22 Other data mining techniques Sequence mining ordering criteria on analyzed data are taken into account example: motif detection in proteins Time series and geospatial data temporal and spatial information are considered example: sensor network data Regression prediction of a continuous value example: prediction of stock quotes Outlier detection example: intrusion detection in network traffic analysis Sensor network 23 11

DataBase and Data Mining Group of Open issues Scalability to huge data volumes Data dimensionality Complex data structures, heterogeneous data formats Data quality Privacy preservation Streaming data 24 12