A Brief Introduction to Data Mining

Similar documents
A Brief Introduction to Data Mining

The R Software Environment

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Nesnelerin İnternetinde Veri Analizi

Introduction to Data Mining

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Knowledge Discovery and Data Mining

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Chapter 3: Data Mining:

Data Mining Course Overview

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

COMP 465 Special Topics: Data Mining

DATA MINING II - 1DL460

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree

2. Data Preprocessing

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Data Mining An Overview ITEV, F /18

PSS718 - Data Mining

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

MetaData for Database Mining

Mining Web Data. Lijun Zhang

Data Visualization in R

Chapter 1, Introduction

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

INTRODUCTION TO DATA MINING

Data warehouse and Data Mining

3. Data Preprocessing. 3.1 Introduction

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

An imputation approach for analyzing mixed-mode surveys

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

COMP90049 Knowledge Technologies

ECLT 5810 Clustering

Mining Web Data. Lijun Zhang

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

8.NS.1 8.NS.2. 8.EE.7.a 8.EE.4 8.EE.5 8.EE.6

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Data analysis using Microsoft Excel

1. Inroduction to Data Mininig

Data Mining Classification through Simulation Technique

Now, Data Mining Is Within Your Reach

Predict Outcomes and Reveal Relationships in Categorical Data

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Customer Clustering using RFM analysis

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Computer Experiments: Space Filling Design and Gaussian Process Modeling

Data Quality Control: Using High Performance Binning to Prevent Information Loss

ISO : 2013 Method Statement

DATA MINING II - 1DL460

D B M G Data Base and Data Mining Group of Politecnico di Torino

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Data mining fundamentals

Introduction to Data Mining

Machine Learning - Clustering. CS102 Fall 2017

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

ECLT 5810 Clustering

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton

4. Feedforward neural networks. 4.1 Feedforward neural network structure

Learning Objectives for Data Concept and Visualization

Data Mining and Data Warehousing Introduction to Data Mining

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Data Warehousing and Machine Learning

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Pre-Requisites: CS2510. NU Core Designations: AD

Chapter 2: Descriptive Statistics

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Study on the Application Analysis and Future Development of Data Mining Technology

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Statistical Learning and Data Mining CS 363D/ SSC 358

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

Patterns that Matter

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can:

NOTES TO CONSIDER BEFORE ATTEMPTING EX 1A TYPES OF DATA

Lecture 7: Decision Trees

Clustering Large Credit Client Data Sets for Classification with SVM

Enterprise Miner Tutorial Notes 2 1

Extra readings beyond the lecture slides are important:

Data Preprocessing. Slides by: Shree Jaswal

Level 5 Diploma in Computing

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

8. Automatic Content Analysis

Visual Information Retrieval: The Next Frontier in Search

Data Management Lecture Outline 2 Part 2. Instructor: Trevor Nadeau

Data Mining and Analytics. Introduction

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

A Survey Of Issues And Challenges Associated With Clustering Algorithms

CS 237: Probability in Computing

Method for security monitoring and special filtering traffic mode in info communication systems

A Survey And Comparative Analysis Of Data

Transcription:

A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Sept, 2014 Introduction Motivation for Data Mining? Data aquisition methods have evolved very rapidly Databases have grown exponentially These data contain useful information for the organisations Size makes manual inspection almost impossible Automatic data analysis methods are required to optimise the use of these huge data sets L.Torgo (DCC-FCUP) DM Intro Sept, 2014 2 / 20

What is Data Mining? Introduction A possible definition: Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarise the data in novel ways that are both understandable and useful to the data owner in Principles of Data Mining (Hand et.al. 2001) L.Torgo (DCC-FCUP) DM Intro Sept, 2014 3 / 20 Introduction Searching for relationships The process of searching for unknown relationships on the data usually involves several steps, like: determining the representation of the problem to use deciding how to quantify and evaluate/compare how well different representations of the relationships (models) fit the available data deciding which data management actions are required to implement the necessary algorithms efficiently L.Torgo (DCC-FCUP) DM Intro Sept, 2014 4 / 20

Introduction An example Suppose we want to understand how the entrance grade of students at the university influences the number of years they take to finish their degree. We have collected a data set where each entry contains the grade of a student (a real number) and the number of years it took her(him) to finish the degree (an integer). We could decide to fit a linear regression model to the data set - a model of the form NrYears = α + β Grade. The degree of fit of this type of models can be calculated by comparing the values predicted by the model against the collected values, and calculating some form of average prediction error As the computations required to obtain this type of models are rather simple, most probably no special data management actions would be necessary even for very large data sets. L.Torgo (DCC-FCUP) DM Intro Sept, 2014 5 / 20 Introduction Data Mining and Knowledge Discovery in Databases Data mining is sometimes taken as one of the steps of the process of knowledge discovery Data Data Pre Processing Data Mining Post Processing Feature selection Dimensionality reduction Normalization Pattern filtering Visualisation Interpretation of results Fonte: Introduction to Data Mining, Tan et.al. (2006) L.Torgo (DCC-FCUP) DM Intro Sept, 2014 6 / 20

Type of Data Sets Data Sets A data set is a collection of measurements taken from some environment. In the simplest case, we have p measurements for a set of n objects, i.e. a data matrix of dimension n p. The n rows represent the objects for which we have collected data. The p columns represent the measurements that were made for each object. The rows of the data matrix are also often named examples, instances, records or cases, while the columns are sometimes referred to as variables, features, fields or attributes. L.Torgo (DCC-FCUP) DM Intro Sept, 2014 7 / 20 Type of Data Sets An example of a data matrix Age Sex Area Income 45 m insurance 85000 32 f education 72500 24 f services 97000 Table : An example of a data table (matrix) L.Torgo (DCC-FCUP) DM Intro Sept, 2014 8 / 20

Type of Data Sets Types of Measurements Quantitative measurements Integer values Real numbers Categorical measurements Ordinal variables (implicit ordering among values - small, medium, large) Nominal variables (no order - red, blue, yellow) L.Torgo (DCC-FCUP) DM Intro Sept, 2014 9 / 20 Types of Data Sets Type of Data Sets Simple data tables (the most common situation) Databases (multiple data tables related with each other) Data streams, time series Text Multimedia data (images, sound, ) L.Torgo (DCC-FCUP) DM Intro Sept, 2014 10 / 20

Types of Models Type and structure of the models Global Local Mathematical formulae Logical formulae Black boxes Different models frequently lead to different compromises in terms of understandability and predictive accuracy L.Torgo (DCC-FCUP) DM Intro Sept, 2014 11 / 20 Types of Models Examples of different models Logical formulae - decision rules IF amount = high AND salary = low AND employment = short.term THEN risk = high IF amount = average AND salary = high THEN risk = low Mathematical formulae housevalue = 10.5 + 5.2 * nrrooms - 3.1 * distcenter + 2.6 * area L.Torgo (DCC-FCUP) DM Intro Sept, 2014 12 / 20

Data Mining Tasks Some of the Main Data Mining Tasks Exploratory Data Analysis summarisation and visualisation tools Descriptive Models probabilistic models clustering models association rules anomaly and deviation detection Predictive Models classification models regression models L.Torgo (DCC-FCUP) DM Intro Sept, 2014 13 / 20 Key Issues The Key Issues on a Data Mining Project Data Structure what to measure? pre-processing steps?... Model Structure what type of model(s) should we build?... Score Function how to evaluate the obtained models?... Optimisation and Search Method how to search and optimise the models in the context of the selected structure?... Data Management Strategy how to handle the data efficiently during model construction/evaluation?... L.Torgo (DCC-FCUP) DM Intro Sept, 2014 14 / 20

Some Illustrative Applications of Data Mining Monitoring and Forecasting Water Quality Parameters Hundreds of water quality parameters Legal limits to obey Heavy fines associated with limits Strong socio-economical impact Two main data mining tasks: Monitoring what is happening Predicting future events L.Torgo (DCC-FCUP) DM Intro Sept, 2014 15 / 20 Sept, 2014 16 / 20 Some Illustrative Applications of Data Mining Monitoring and Forecasting Forest Fires Problem with a strong socio-economical impact Identify the key drivers for fire occurrence Socio-demographic factors Landscape characteristics Meteorological factors Identify trends Forecast problems L.Torgo (DCC-FCUP) DM Intro

Some Illustrative Applications of Data Mining Data from an International Expedition to Antarctica Identify key factors for the existence of life under extreme conditions Identification of geographical regions with high relevance in terms of bio-diversity L.Torgo (DCC-FCUP) DM Intro Sept, 2014 17 / 20 Some Illustrative Applications of Data Mining Spatial Interpolation Methods Forecast values of variables at places where no data is available Sampling cost reduction Potential applications: Security Biology From: To: L.Torgo (DCC-FCUP) DM Intro Sept, 2014 18 / 20

Some Illustrative Applications of Data Mining Fraud Detection Fraud detection usually involves auditing activities These have costs and are resource-bounded Detected frauds have different outcomes (results/benefits) How to apply the limited auditing resources to the most promising cases? L.Torgo (DCC-FCUP) DM Intro Sept, 2014 19 / 20 Some Illustrative Applications of Data Mining Monitoring and Forecasting Machine Failures Industrial machines with several attached sensors measuring different parameters Different production contexts What is normal varies with the context How to anticipate machine failures to take preventive actions? L.Torgo (DCC-FCUP) DM Intro Sept, 2014 20 / 20