Big Data Analytics The Data Mining process. Roger Bohn March. 2016

Similar documents
Big Data Analytics The Data Mining process. Roger Bohn March. 2017

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

COMP 465 Special Topics: Data Mining

Knowledge Discovery and Data Mining

Introduction to Data Mining and Data Analytics

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Chapter 1, Introduction

Question Bank. 4) It is the source of information later delivered to data marts.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Applying Supervised Learning

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

DATA MINING II - 1DL460

Introduction to Data Mining

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Note: In the presentation I should have said "baby registry" instead of "bridal registry," see

Overview of Web Mining Techniques and its Application towards Web

Now, Data Mining Is Within Your Reach

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

DATA MINING II - 1DL460

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

Data Mining and Analytics. Introduction

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Basic Concepts Weka Workbench and its terminology

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Machine Learning Chapter 2. Input

Notes based on: Data Mining for Business Intelligence

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

Data Mining Practical Machine Learning Tools and Techniques

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Data Analyst Nanodegree Syllabus

Data Preprocessing. Slides by: Shree Jaswal

SOCIAL MEDIA MINING. Data Mining Essentials

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Winter Semester 2009/10 Free University of Bozen, Bolzano

Chapter 3: Data Mining:

Data warehouse and Data Mining

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

PSS718 - Data Mining

Enterprise Miner Tutorial Notes 2 1

Data Analyst Nanodegree Syllabus

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Slice Intelligence!

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Data Mining Concepts

Machine Learning - Clustering. CS102 Fall 2017

Data Mining: Models and Methods

Data Mining Course Overview

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

DATA MINING AND WAREHOUSING

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Clustering algorithms and autoencoders for anomaly detection

Data Science Bootcamp Curriculum. NYC Data Science Academy

YEAR 12 Trial Exam Paper FURTHER MATHEMATICS. Written examination 1. Worked solutions

Classification and Regression

Mineração de Dados Aplicada

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Mining: Approach Towards The Accuracy Using Teradata!

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Input: Concepts, Instances, Attributes

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

Data Mining Concepts & Tasks

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Data Preprocessing. Data Preprocessing

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Data Mining Lecture 8: Decision Trees

Taking Your Application Design to the Next Level with Data Mining

Oracle9i Data Mining. An Oracle White Paper December 2001

A Platform for Provisioning Integrated Data and Visualization Capabilities Presented to SATURN in May 2016 Gerry Giese, Sandia National Laboratories

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Data Preprocessing. Data Mining 1

Jarek Szlichta

Knowledge Discovery and Data Mining

Data Preprocessing. Supervised Learning

Summary. Machine Learning: Introduction. Marcin Sydow

Business Analytics Nanodegree Syllabus

Knowledge Discovery and Data Mining

Chapter 3: Supervised Learning

Knowledge Discovery in Data Bases

Fraud Detection Using Random Forest Algorithm

Random Forest A. Fornaser

An Automated System for Data Attribute Anomaly Detection

Slides for Data Mining by I. H. Witten and E. Frank

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Data Analytics Training Program

Transcription:

1 Big Data Analytics The Data Mining process Roger Bohn March. 2016 Office hours HK thursday5 to 6 in the library 3115 If trouble, email or Slack private message. RB Wed. 2 to 3:30 in my office Some material from Data Mining for Business Intelligence By Shmueli, Patel & Bruce

Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! Necessity is the mother of invention Data mining Automated analysis of massive data sets 6

Economist: Special report on BD in politics 7 Technology and politics The signal and the noise Mar 26th 2016 Ever easier communications and evergrowing data mountains are transforming politics in unexpected ways, says Ludwig Siegele. What will that do to democracy? Technology and politics: The signal and the noise Election campaigns: Politics by numbers Tracking protest movements: A new kind of weather Online collaboration: Connective action Local government: How cities score Living with technology: The data republic

What can we do with Data Mining? 8 Exploratory Data Analysis Predictive Modeling: Classification and Regression Descriptive Modeling Cluster analysis/segmentation Discovering Patterns and Rules Association/Dependency rules Sequential patterns Temporal sequences Deviation detection 13

Canonical examples Should we approve this transaction? (Is it fraudulent? Likely to fail?) Credit cards Mortgages Which financial reports to audit more carefully? Which buildings to inspect in NY City? What to recommend to a user? 9

Choose a good problem Gather data Data Analytics Process 10 Locate, download, examine Model Clean it e.gmissing data, out of range Model the problem Create new variables Outcome = f(x1, X2,. Xn) Select an algorithm Run analysis. Interpret results Run analysis Interpret Iterate until satisfied; DEPLOY

11

Steps in Data Mining 12 1. Define/understand problem/question/decision 2. Obtain data (may involve random sampling) 3. Explore, clean, pre-process data 4. Specify task (classification, clustering, etc.) 5. Try one or more algorithms (regression, k- Nearest Neighbors, trees, neural networks, etc.) 6. Iterative implementation and tuning 7. Assess results compare models 8. Deploy model in production mode. Daily use

Knowledge Discovery (KDD) Process This is a view from typical database systems and data warehousing Pattern Evaluation communities Data mining plays an essential role in the knowledge discovery process Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 13

Data Mining in Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA 14

Supervised Learning = This course 15 Goal: Predict a single target or outcome variable Training data, where target value is known Methods: Classification and Prediction

Supervised: Classification 16 Goal: Predict categorical target (outcome) variable Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy Each row is a case (customer, tax return, applicant) Each column is a variable Target variable is often binary (yes/no) Deliberately biased classifications: cost of errors

Supervised: Prediction 17 Goal: Predict numerical target (outcome) variable Examples: sales, revenue, performance As in classification: Each row is a case (customer, tax return, applicant) Each column is a variable Regression a common tool, but often not interested in value of the coefficients per se. Instead: forecast outcome for a new case Taken together, classification and prediction constitute predictive analytics

(Unsupervised) Data Visualization 18 Graphs and plots of data Histograms, boxplots, bar charts, scatterplots Especially useful to examine relationships between pairs of variables General concept: Exploratory Data Analysis Where do you start with new data?

Obtaining Data: Sampling 19 Data mining typically deals with huge databases Algorithms and models are typically applied to a sample from a database, to produce statisticallyvalid results Once you develop and select a final model, you use it to score the observations in the larger database

Pre-processing Data 1. Format conversion e.g. text to numeric 2. Parsing e.g. web data 3. Merging data from multiple sources 4. Dealing with outliers 5. Missing observations (some algorithms don t care) 6. Rare event oversampling 7. Normalizing Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. For Big-Data Scientists, Janitor Work Is Key Hurdle to Insights By STEVE LOHR NY Times AUG. 17, 2014

Convert Variable Types 21 Determine the types of pre-processing needed, and algorithms used Main distinction: Categorical vs. numeric Categorical variables Binary (male/female, student/non-student) Ordered (low, medium, high) Unordered (Ford, Toyota, Honda)

Detecting Outliers 22 An outlier is an observation that is extreme, being distant from the rest of the data (definition of distant is deliberately vague) Outliers can have disproportionate influence on models (a problem if it is spurious) An important step in data pre-processing is detecting outliers Once detected, domain knowledge is required to determine if it is an error, or truly extreme. Common example: misplaced decimal point

Handling Missing Data 23 Many algorithms will not process records with missing values. Default is to drop those records. Solution 1: Omission If a small number of records have missing values, can omit them If many records are missing values on a small set of variables, can drop those variables (or use proxies) If many records have missing values, omission is not practical Solution 2: Imputation Replace missing values with reasonable substitutes Lets you keep the record and use the rest of its (non-missing) information Solution 3: Use an algorithm that handles missing data (Classification Trees)

Rare event oversampling 24 Often the event of interest is rare Examples: response to mailing, fraud, Only a few percent of total sample. Sampling may yield too few interesting cases to effectively train a model A popular solution: oversample the rare cases to obtain a more balanced training set Later, need to adjust results for the oversampling

Normalizing (Standardizing) Data Needed when variables with the largest scales would dominate and skew results 25 Needed for some algorithms (eg knn); not for others (regression) Puts all variables on same scale Is weight in g or kg? Meters or feet or km? Normalizing function: Subtract mean and divide by standard deviation Alternative: scale to 0-1 by subtracting minimum and dividing by the range Useful when the data contain dummies and numeric Sometimes best not to normalize. More insight from coefficient values.

Partitioning the Data 27 Problem: How well will our model perform with new data? Solution: Separate data into two parts Training partition to develop the model Validation partition to implement the model and evaluate its performance on new data Addresses the issue of overfitting

Multiple Partitions When a model is developed on training data, it can overfit the training data (hence need to assess on validation) Assessing multiple models on same validation data can overfit validation data Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data Solution: final selected model is applied to a third test partition. Unbiased estimate of its performance on new data 28

Concept should be used in classic regression analysis Instead of trying to estimate the forecast errors, just measure them! Standard error of residuals t test = no longer needed! Statistical estimates of errors have long list of assumptions: Homoskedastic errors No autocorrelation 29 No important omitted variables (hah) Hold aside a testing sample

100% fit not useful for new data 1600 1400 1200 1000 Revenue 800 600 400 200 0 0 100 200 300 400 500 600 700 800 900 1000 Expenditure 35

Summary 36 Data Mining includes many supervised methods (Classification & Prediction) + some unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization) Before algorithms can be applied, data must be characterized and pre-processed. This takes work! And thought! And creativity! To evaluate performance and to avoid overfitting, partition the data Data mining methods are applied to a part of a large dataset, and then the best model is used to analyze the rest of dataset

For Monday: Analyze car data Selling prices: high or low (binary classification) Data is already imported (might have errors) Which variables to include? Reclassify any variables. Number of doors? 2, 3, 4, 5 Ordered categories, or not? Use Classification tree to predict ( fit ) high or low Measure your accuracy using different observations Try different variables to improve accuracy: how much can you improve. 38

Partitioning the data 39