Lluis Belanche + Alfredo Vellido Data Mining II An Introduction to Mining (2)

Similar documents
Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining. Data Analysis and Knowledge Discovery

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

The CRISP-DM Process Model

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

CRISP-DM 1.0. Step-by-step data mining guide

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

Introduction to Data Mining

The Data Science Process. Polong Lin Big Data University Leader & Data Scientist IBM

TIM 50 - Business Information Systems

SEGUE DISCOVERY PARTICIPATION IN DISCOVERY DISCOVERY DELIVERABLES. Discovery

A Variability-Aware Design Approach to the Data Analysis Modeling Process

Data Mining An Overview ITEV, F /18

9. Conclusions. 9.1 Definition KDD

Practical Guide to Cloud Computing Version 2. Read whitepaper at

Oracle Big Data Science

Enhancing Preprocessing in Data-Intensive Domains using Online-Analytical Processing

BEST BIG DATA CERTIFICATIONS

Now, Data Mining Is Within Your Reach

Data Mining Overview. CHAPTER 1 Introduction to SAS Enterprise Miner Software


Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

TIM 50 - Business Information Systems

Integrating MATLAB Analytics into Business-Critical Applications Marta Wilczkowiak Senior Applications Engineer MathWorks

Agile Accessibility. Presenters: Ensuring accessibility throughout the Agile development process

PROIV Annual Announcement Event 15 th July 2015

ENTERPRISE MINER: 1 DATA EXPLORATION AND VISUALISATION

Week 1 Unit 1: Introduction to Data Science

Data Management Glossary

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

An Introduction to Data Mining in Institutional Research. Dr. Thulasi Kumar Director of Institutional Research University of Northern Iowa

Assignments. Assignment 2 is due TODAY, 11:59pm! Submit one per pair on Blackboard.

opensap Getting Started with Data Science

a brief introduction to creating quality software continuously Copyright 2011 Davisbase, LLC

Applying Auto-Data Classification Techniques for Large Data Sets

EU mhealth Working Group

Slice Intelligence!

SYLLABUS. Departmental Syllabus. Structured Query Language (SQL)

The development process of the Online S3 project. Anastasia Panori, INTELSPACE Innovation Technologies S.A.

How to choose a website design firm

MAASTO TPIMS Systems Engineering Analysis. Documentation

The Future of Analytics or The New SQL

Stakeholder consultation process and online consultation platform

REVENUE REPORTING DASHBOARD FOR A HOTEL GROUP

Boost your Analytics with Machine Learning for SQL Nerds. Julie mssqlgirl.com

DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS

Intelligence for the connected world How European First-Movers Manage IoT Analytics Projects Successfully

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Introducing Oracle Machine Learning

Saving the Project Brief document under its own name

Enterprise Guest Access

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

CLOUD WORKLOAD SECURITY

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

SAS Enterprise Miner 7.1

Data Analysis Using Sql And Excel 2nd Edition

Gain Greater Productivity in Enterprise Data Mining

CoE CENTRE of EXCELLENCE ON DATA WAREHOUSING

Data Sheet - Site and User Analytics for SharePoint PRODUCT BROCHURE.

Lecture 8 Requirements Engineering

Standards, Evaluation Criteria and Best Practices Telecommunications and Technology Advisory Committee Systemwide Architecture Committee.

MGA Developing Interactive Systems (5 ECTS), spring 2017 (16 weeks)

Data Entry, and Manipulation. DataONE Community Engagement & Outreach Working Group

Optimizing Your Analytics Life Cycle with SAS & Teradata. Rick Lower

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Warehousing and Machine Learning

Database Systems: Concepts, design, and implementation ISE 382 (3 Units)

DATA MINING TEAM #1. Kristen Durst Mark Gillespie Banan Mandura. MBA 664: Database Management

Queries give database managers its real power. Their most common function is to filter and consolidate data from tables to retrieve it.

Re-using Data Mining Workflows

HP Storage Summit 2015 Transform Now.

COCKPIT FP Citizens Collaboration and Co-Creation in Public Service Delivery. Deliverable D Opinion Mining Tools 1st version

Red Hat Application Migration Toolkit 4.0

ERP Solution to the Cloud

This tutorial also elaborates on other related methodologies like Agile, RAD and Prototyping.

Eight units must be completed and passed to be awarded the Diploma.

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Creating an Intranet using Lotus Web Content Management. Part 2 Project Planning

Introducing Oracle R Enterprise 1.4 -

Using ArcGIS Online to Release an AODA Compliant Application

COCKPIT FP Citizens Collaboration and Co-Creation in Public Service Delivery. Deliverable D2.4.1

Units. Unit 4: Internet. Year 1 Unit 1: Course Overview

Embarking on the next stage of hosted desktop delivery for international events management company

Advanced Data Modeling: Be Happier, Add More Value and Be More Valued

Il caso della Prescriptive Maintenance

Managing Data Resources

DATA MINING AND WAREHOUSING

Chapter 17: INTERNATIONAL DATA PRODUCTS

NCHRP Project Impacts of Connected Vehicles and Automated Vehicles on State and Local Transportation Agencies

CTL.SC4x Technology and Systems

09/07: Project Plan. The Capstone Experience. Dr. Wayne Dyksen Department of Computer Science and Engineering Michigan State University Fall 2016

Creating a Departmental Standard SAS Enterprise Guide Template

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Department of Computer Science and Information Systems, College of Business and Technology, Morehead State University

Oracle Big Data Science IOUG Collaborate 16

Seamless Dynamic Web (and Smart Device!) Reporting with SAS D.J. Penix, Pinnacle Solutions, Indianapolis, IN

Continuous Delivery and Team Foundation Server Ognjen Bajić Ana Roje Ivančić Ekobit

SOFTWARE DEVELOPMENT: DATA SCIENCE

Technology Strategy and Roadmap. October 2015

Units. Year 1 Unit 1: Course Overview. Unit 4: Internet

Stages of Data Processing

Transcription:

Lluis Belanche + Alfredo Vellido Data Mining II An Introduction to Mining (2)

On dates & evaluation: Lectures expected to end on the week 14-18th Dec Likely essay deadline & presentation: 15th, 22nd Jan

What s MINING?: A historicist viewpoint $!%&!"#"

MINING as a methodology

CRISP: a DM methodology CRoss-Industry Standard Process for Data Mining: neutral methodology from the point of view of industry, tool and application (free & non-proprietary) Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth (DaimlerChrysler) CRISP-DM was conceived in 1996 DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development (Clementine, 1994), NCR: owners of large (huge!) databases (Teradata) Financed by the EU. Version 1.0 released officially in 1999

CRISP: Hierarchic structure of the methodology

CRISP: The virtuous loop of methodology phases

CRISP: Description of phases Problem understanding: study of targets and requirements form the business/problem viewpoint. Defining it as a DM problem. Data understanding: data recolection; getting to know the data, trying to detect both quality problems and interesting features. Data preparation: Preparing the data set to be modelled, starting from raw data. This is an iterative and exploratory process. Selection of files, tables, variables, record samples plus data cleaning. Modelling: Data analysis using modelling techniques of a sort that are suitable for the problem at hand. Includes fiddling with the models, tuning their parameters, etc. Evaluation: All previous steps must be evaluated as whole (as a unitary process), and we must decide whether deliverables so far meet the DM challenge. Implementation: All the knowledge aquired to this point must be organized and presented to the client in a usable form. We must define, together with this client, a protocol to reliably deploy the DM findings.

CRISP: The virtuous loop of methodology phases

Use of DM methodologies (2004 2007)!! "#$ %$ Enterprise MinerTM: SEMMA The acronym SEMMA -- Sample, Explore, Modify, Model, Assess -- refers to the core process of conducting data mining. Beginning with a statistically representative sample of your data, SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and confirm a model's accuracy.

Use of DM methodologies (2004 2007) 2004 2007

CRISP: Phases: Problem understanding PROBLEM UNDERSTANDING UNDERST ING PREPARATION MODELLING EVALUATION IMPLEMEN TATION DETERMINE PROBLEM GOAL BACKGROUND PROBLEM GOALS SUCCESS CRITERIA ASSESS SITUATION INVENTORY RESOURCES REQUERIMS. ASSUMPTIONS LIMITATIONS RISKS CONTINGEN. TERMINOLOG. COSTS & BENEFITS DETERMINE DM GOALS GOALS DM SUCCESS CRITERIA DM PRODUCE PROJECT PLAN PROJECT PLAN INITIAL SELECTION OF TOOLS

DM application areas ( 06-> 09) & &'( )*+$$, (! $, -$.)*+ ( $+, '( /$,#.0$1, 2(2 3 $4,$1.$,#2 &( "#$2 &( 5$6$,1 ( 3 $4*$1 (',$,$ (' *,$ (' $6 ( 7$1$.,- ( $+,6.#1! (& *8,* ( 07$1$. 6 ( $,11$,$ 2(2 57$6.9:62 (2 $,*.$12 (2 9$6#,$.92 (2 ;*-$16.:1 (! $1$. *, (! /- ('

CRISP: Phases: Data understanding PROBLEM UNDERSTANDING UNDERST ING PREPARATION MODELLING EVALUATION IMPLEMEN TATION OBTAIN INITIAL INITIAL REPORT DESCRIPTION EXPLORATION VERIFICATION QUALITY DESCRIPTIVE REPORT EXPLORATION REPORT QUALITY REPORT

METROFANG: a real story about data understanding (1)

METROFANG: a real story about data understanding (2) caudal entrada 350,00 300,00 250,00 200,00 150,00 100,00 50,00 0,00 1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671 Par motor Secador A 140,00 120,00 100,00 80,00 Missing data Stationality Outliers Time Series Weekend? FORUM??? 60,00 40,00 20,00 0,00 1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Storing data ( 07) Poll What did you use for data storage for significant data mining projects in the past year: [142 voters, 284 votes] Text files (e.g. tab or comma delim) (75) 52.8% Data mining system format (SAS, SPSS, arff) (57) 40.1% Excel (28) 19.7% Oracle (25) 17.6% SQL Server (15) 10.6% mysql (12) 8.5% other format (10) 7.0% other commercial DBMS (7) 4.9% other free DBMS (4) 2.8%

CRISP: Phases: Data preparation PROBLEM UNDERSTANDING UNDERST ING PREPARATION MODELLING EVALUATION IMPLEMEN TATION SELECTION ARGUMENTS FOR SELECTION CLEANING DATOA CLEANING REPORT RECONSTRUCT DERIVATED VARIABLES OSERVATIONS GENERATED INTEGRATE INTEGRATED FORMATTING WITH NEW FORMAT

Is data preparation that important?! "#$ " 7$!!& 6$2! 2 2 &'

Common data types analyzed ( 07) Compared to 2005 KDnuggets Poll on Types of data you analyzed/mined in last 12 months, the biggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues).

Common data types analyzed ( 09)

How big is yours? ( 06 -> 09) % & ' 6$# /2 ( / / /0/ (0/2 0/ 0/5$4$! 7$5$4$ & 2 2 2

Data manipulation tools ( 07)

CRISP: Phases: Modelling PROBLEM UNDERSTANDING UNDERST ING PREPARATION MODELLING EVALUATION IMPLEMEN TATION SELECT MODELING TECHNIQUE SELECTED TECHNIQUE CREATE TEST DESIGN TEST DESIGN BUILD MODEL PARAMETER SELECTION MODEL MODEL DESCRIPTION VALIDATE MODEL MODEL VALIDATION

CRISP: Selection of techniques U N I V E R S E OF T E C H N I Q U E S (Definided by tools) TECHNIQUES SUITED TO A PROBLEM POLITICAL REQUIREMENTS (Business, executive) Money, time, hh.rr. LIMITATIONS Data types, knowledge SELECTED TOOL(S)

Commonly used models/techniques ( 05) (" ") ) *+ $,5$$.*6$ 6*$ & $$',! <*6&! %$*6%$!,*6$2 %$$%$#4& < *::7$,1,#$& /$& $=*$,$.51$$$6 & /2 & 94+1$#+& & / & 0$$,6#1' "#$ &

Commonly used models/techniques ( 07)

CRISP: Phases: Evaluation PROBLEM UNDERSTANDING UNDERST ING PREPARATION MODELLING EVALUATION IMPLEMEN TATION EVALUATE RESULTS EVOLUTION OF DM RESULTS APPROVED MODELS REVISE PROCESSES REVISION OF THE PROCESS DETERMINE NEXT STEPS LIST OF POSSIBLE ACTIONS DECISSIONS

CRISP: Phases: Deployment PROBLEM UNDERSTANDING UNDERST ING PREPARATION MODELLING EVALUATION IMPLEMEN TATION PLAN IMPLEMEN TATION IMPLEMENTATION PLAN PLAN MONITORIZATION & MAINTENANCE MONITORIZATION & MAINTENANCE PLAN GENERATE FINAL REPORT FINAL REPORT FINAL PRESENTATION REVISE PROJECT DOCUMENTATION OF EXPERIENCE

How do you deploy it? ( 06 > 09), #- $*./ *46#$$,#::$& >$8+,#$4*$*6$??? $:6:+*,+ ( >$+168, 7$1+$6@A 7$1+$6#$6*$ 7$1+$6;7 7$1+$6 A??? $:64,#1+$! $:6$61$1+$ &!(' (!( '(2 ( (! (! ( 2(2 ( Cloud computing : computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet. An example Google Apps

Software popularity ( 07) Free vs. commercial: debate

Software popularity ( 09)

'%' $(") $(")*+%' %,%, - Why? Many changes have occurred in the business application of data mining since CRISP-DM 1.0 was published. Emerging issues and requirements include: The availability of new types of data text, Web, and attitudinal data, for example along with new techniques for pre-processing, analyzing, and combining them with related case data Integration and deployment of results with operational systems such as call centers and Web sites Far more demanding requirements for scalability and for deployment into real-time environments The need to package analytical tasks for non-analytical end users and integrate these tasks in business workflows The need to seamlessly integrate the deployment of results and closed-loop feedback with existing business processes The need to mine large-scale databases in situ, rather than exporting an analytical dataset Organizations increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analytics In July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP-DM. On 26 September 2006, the CRISP-DM SIG met to discuss potential enhancements for CRISP-DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.