Data Mining with SPSS Modeler

Size: px
Start display at page:

Download "Data Mining with SPSS Modeler"

Transcription

1 Tilo Wendler Soren Grottrup Data Mining with SPSS Modeler Theory, Exercises and Solutions Springer

2 1 Introduction The Concept of the SPSS Modeler Structure and Features of This Book Prerequisites for Using This Book Structure of the Book and the Exercise/Solution Concept Using the Data and Streams Provided with the Book Datasets Provided with This Book Template Concept of This Book Introducing the Modeling Process Exercises Solutions IB Literature 22 2 Basic Functions of the SPSS Modeler Defining Streams and Scrolling Through a Dataset Switching Between Different Streams Defining or Modifying Value Labels Adding Comments to a Stream Exercises Solutions Data Handling and Sampling Methods Theory Calculations String Functions Extracting/Selecting Records Filtering Data Data Standardization: Z-Transformation Partitioning Datasets Sampling Methods Merge Datasets Ill vii

3 viii Contents Append Datasets Exercises Solutions 147 Literature Univariate Statistics Theory Discrete Versus Continuous Variables Scales of Measurement Exercises Solutions Simple Data Examination Tasks Theory Frequency Distribution of Discrete Variables Frequency Distribution of Continuous Variables Distribution Analysis with the Data Audit Node Concept of "SuperNodes" and Transforming a Variable to Normality Reclassifying Values Binning Continuous Data Exercises Solutions 259 Literature Multivariate Statistics Theory Scatterplot Scatterplot Matrix Correlation Correlation Matrix Exclusion of Spurious Correlations Contingency Tables Exercises Solutions 325 Literature Regression Models Introduction to Regression Models Motivating Examples Concept of the Modeling Process and Cross-Validation Simple Linear Regression Theory Building the Stream in SPSS Modeler Identification and Interpretation of the Model Parameters Assessment of the Goodness of Fit 362

4 ix Predicting Unknown Values Exercises Solutions Multiple Linear Regression Theory Building the Model in SPSS Modeler Final MLR Model and Its Goodness of Fit Prediction of Unknown Values Cross-Validation of the Model Boosting and Bagging (for Regression Models) Exercises Solutions Generalized Linear (Mixed) Model Theory Building a Model with the GLMM Node The Model Nugget Cross-Validation and Fitting a Quadric Regression Model Exercises Solutions The Auto Numeric Node Building a Stream with the Auto Numeric Node The Auto Numeric Model Nugget Exercises Solutions 500 Literature Factor Analysis Motivating Example General Theory of Factor Analysis Principal Component Analysis Theory Building a Model in SPSS Modeler Exercises Solutions Principal Factor Analysis Theory Building a Model Exercises Solutions 579 Literature Cluster Analysis Motivating Examples General Theory of Cluster Analysis 589

5 7.2.1 Exercises Solutions TwoStep Hierarchical Agglomerative Clustering Theoiy of Hierarchical Clustering Characteristics of the TwoStep Algorithm Building a Model in SPSS Modeler Exercises Solutions K-Means Partitioning Clustering Theory Building a Model in SPSS Modeler Exercises Solutions Auto Clustering Motivation and Implementation of the Auto Cluster Node Building a Model in SPSS Modeler Exercises Solutions Summary 710 Literature Classification Models Motivating Examples General Theory of Classification Models Process of Training and Using a Classification Model Classification Algorithms Classification vs. Clustering Making a Decision and the Decision Boundary Performance Measures of Classification Models The Analysis Node Exercises Solutions Logistic Regression Theory Building the Model in SPSS Modeler Optional: Model Types and Variable Interactions Final Model and Its Goodness of Fit Classification of Unknown Values Cross-Validation of the Model Exercises Solutions 758

6 xi 8.4 Linear Discriminate Classification Theory Building the Model with SPSS Modeler The Model Nugget and the Estimated Model Parameters Exercises Solutions Support Vector Machine Theory Building the Model with SPSS Modeler The Model Nugget Exercises Solutions Neuronal Networks Theory Building a Network with SPSS Modeler The Model Nugget Exercises Solutions k-nearest Neighbor Theory Building the Model with SPSS Modeler The Model Nugget Dimensional Reduction with PCA for Data Preprocessing Exercises Solutions Decision Trees Theory Building a Decision Tree with the C5.0 Node The Model Nugget Building a decision tree with the CHAID node Exercises Solutions The Auto Classifier Node Building a Stream with the Auto Classifier Node The Auto Classifier Model Nugget Exercises Solutions 974 Literature Using R with the Modeler Advantages of R with the Modeler Connecting with R Test the SPSS Modeler Connection to R Calculating New Variables in R 994

7 xii Contents 9.5 Model Building in R Exercises Solutions 1018 Literature Appendix Data Sets Used in This Book adult_income_data.txt beer.sav benchmark.xlsx car_simple.sav car_sales_modified.sav chess_endgame_data.txt customer_bank_data.csv diabetes_data_reduced.sav DRUGln.sav EEG_Sleep_Signals.csv employee_dataset_001 and employee_dataset_ England Payment Datasets Features_eeg_signals.csv gene_expression_leukemia.csv gene_expression_leukemia_short.csv gravity_constant_data.csv Housing.data.txt Iris.csv IT-projects.txt IT user satisfaction.sav longley.csv LPGA2009.csv Mtcars.csv nutrition_habites.sav optdigits_training.txt, optdigits_test.txt Orthodont.csv Ozone.csv pisa2012_math_q45.sav sales_list.sav ships.csv test_scores.sav Titanic.xlsx tree_credit.sav wine_data.txt WisconsinBreastCancerData.csv z_pm_customerl.sav 1057 Literature 1057

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram. Subject Copy paste feature into the diagram. When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Now, Data Mining Is Within Your Reach

Now, Data Mining Is Within Your Reach Clementine Desktop Specifications Now, Data Mining Is Within Your Reach Data mining delivers significant, measurable value. By uncovering previously unknown patterns and connections in data, data mining

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Machine Learning in Action

Machine Learning in Action Machine Learning in Action PETER HARRINGTON Ill MANNING Shelter Island brief contents PART l (~tj\ssification...,... 1 1 Machine learning basics 3 2 Classifying with k-nearest Neighbors 18 3 Splitting

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

From Building Better Models with JMP Pro. Full book available for purchase here.

From Building Better Models with JMP Pro. Full book available for purchase here. From Building Better Models with JMP Pro. Full book available for purchase here. Contents Acknowledgments... ix About This Book... xi About These Authors... xiii Part 1 Introduction... 1 Chapter 1 Introduction...

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

> Data Mining Overview with Clementine

> Data Mining Overview with Clementine > Data Mining Overview with Clementine This two-day course introduces you to the major steps of the data mining process. The course goal is for you to be able to begin planning or evaluate your firm s

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

PARALLEL CLASSIFICATION ALGORITHMS

PARALLEL CLASSIFICATION ALGORITHMS PARALLEL CLASSIFICATION ALGORITHMS By: Faiz Quraishi Riti Sharma 9 th May, 2013 OVERVIEW Introduction Types of Classification Linear Classification Support Vector Machines Parallel SVM Approach Decision

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING SECOND EDITION IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING ith Algorithms for ENVI/IDL Morton J. Canty с*' Q\ CRC Press Taylor &. Francis Group Boca Raton London New York CRC

More information

SCHEME OF COURSE WORK. Data Warehousing and Data mining

SCHEME OF COURSE WORK. Data Warehousing and Data mining SCHEME OF COURSE WORK Course Details: Course Title Course Code Program: Specialization: Semester Prerequisites Department of Information Technology Data Warehousing and Data mining : 15CT1132 : B.TECH

More information

PATTERN CLASSIFICATION AND SCENE ANALYSIS

PATTERN CLASSIFICATION AND SCENE ANALYSIS PATTERN CLASSIFICATION AND SCENE ANALYSIS RICHARD O. DUDA PETER E. HART Stanford Research Institute, Menlo Park, California A WILEY-INTERSCIENCE PUBLICATION JOHN WILEY & SONS New York Chichester Brisbane

More information

Data Clustering in C++

Data Clustering in C++ Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Data Clustering in C++ An Object-Oriented Approach Guojun Gan CRC Press Taylor & Francis Group Boca Raton London New York CRC Press Is an imprint

More information

CLASSIFICATION AND CHANGE DETECTION

CLASSIFICATION AND CHANGE DETECTION IMAGE ANALYSIS, CLASSIFICATION AND CHANGE DETECTION IN REMOTE SENSING With Algorithms for ENVI/IDL and Python THIRD EDITION Morton J. Canty CRC Press Taylor & Francis Group Boca Raton London NewYork CRC

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

CSE 190, Spring 2015: Midterm

CSE 190, Spring 2015: Midterm CSE 190, Spring 2015: Midterm Name: Student ID: Instructions Hand in your solution at or before 7:45pm. Answers should be written directly in the spaces provided. Do not open or start the test before instructed

More information

Advanced Applied Multivariate Analysis

Advanced Applied Multivariate Analysis Advanced Applied Multivariate Analysis STAT, Fall 3 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: sungkyu@pitt.edu http://www.stat.pitt.edu/sungkyu/ / 3 General Information Course

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

JMP Book Descriptions

JMP Book Descriptions JMP Book Descriptions The collection of JMP documentation is available in the JMP Help > Books menu. This document describes each title to help you decide which book to explore. Each book title is linked

More information

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification CAMCOS Report Day December 9 th, 2015 San Jose State University Project Theme: Classification On Classification: An Empirical Study of Existing Algorithms based on two Kaggle Competitions Team 1 Team 2

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author...

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author... An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. Contents About This Book... ix About The Author... xiii Chapter 1: Data Management in the Analytics Process...

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Image Analysis, Classification and Change Detection in Remote Sensing

Image Analysis, Classification and Change Detection in Remote Sensing Image Analysis, Classification and Change Detection in Remote Sensing WITH ALGORITHMS FOR ENVI/IDL Morton J. Canty Taylor &. Francis Taylor & Francis Group Boca Raton London New York CRC is an imprint

More information

Python With Data Science

Python With Data Science Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,

More information

Intro to Artificial Intelligence

Intro to Artificial Intelligence Intro to Artificial Intelligence Ahmed Sallam { Lecture 5: Machine Learning ://. } ://.. 2 Review Probabilistic inference Enumeration Approximate inference 3 Today What is machine learning? Supervised

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Three-Dimensional Computer Vision

Three-Dimensional Computer Vision \bshiaki Shirai Three-Dimensional Computer Vision With 313 Figures ' Springer-Verlag Berlin Heidelberg New York London Paris Tokyo Table of Contents 1 Introduction 1 1.1 Three-Dimensional Computer Vision

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Predicting Gene Function and Localization

Predicting Gene Function and Localization Predicting Gene Function and Localization By Ankit Kumar and Raissa Largman CS 229 Fall 2013 I. INTRODUCTION Our data comes from the 2001 KDD Cup Data Mining Competition. The competition had two tasks,

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition

More information

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION vi TABLE OF CONTENTS ABSTRACT LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION iii xii xiii xiv 1 INTRODUCTION 1 1.1 WEB MINING 2 1.1.1 Association Rules 2 1.1.2 Association Rule Mining 3 1.1.3 Clustering

More information

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

Fundamentals of Digital Image Processing

Fundamentals of Digital Image Processing \L\.6 Gw.i Fundamentals of Digital Image Processing A Practical Approach with Examples in Matlab Chris Solomon School of Physical Sciences, University of Kent, Canterbury, UK Toby Breckon School of Engineering,

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract 19-1214 Chemometrics Technical Note Description of Pirouette Algorithms Abstract This discussion introduces the three analysis realms available in Pirouette and briefly describes each of the algorithms

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

BIG DATA SCIENTIST Certification. Big Data Scientist

BIG DATA SCIENTIST Certification. Big Data Scientist BIG DATA SCIENTIST Certification Big Data Scientist Big Data Science Professional (BDSCP) certifications are formal accreditations that prove proficiency in specific areas of Big Data. To obtain a certification,

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

COMS 4771 Clustering. Nakul Verma

COMS 4771 Clustering. Nakul Verma COMS 4771 Clustering Nakul Verma Supervised Learning Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find

More information

Pattern Recognition for Neuroimaging Data

Pattern Recognition for Neuroimaging Data Pattern Recognition for Neuroimaging Data Edinburgh, SPM course April 2013 C. Phillips, Cyclotron Research Centre, ULg, Belgium http://www.cyclotron.ulg.ac.be Overview Introduction Univariate & multivariate

More information

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline Bias

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Course File Leaf (Theory) For the Academic Year (Odd/Even Semester)

Course File Leaf (Theory) For the Academic Year (Odd/Even Semester) Nadar Saraswathi College of Engineering and Technology, Vadapudupatti, Theni - 625 531 (Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai) Course File Leaf (Theory) For the Academic

More information

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones What is machine learning? Data interpretation describing relationship between predictors and responses

More information

Business Intelligence Roadmap HDT923 Three Days

Business Intelligence Roadmap HDT923 Three Days Three Days Prerequisites Students should have experience with any relational database management system as well as experience with data warehouses and star schemas. It would be helpful if students are

More information

Pre-Requisites: CS2510. NU Core Designations: AD

Pre-Requisites: CS2510. NU Core Designations: AD DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic

More information

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016 Machine Learning for Signal Processing Clustering Bhiksha Raj Class 11. 13 Oct 2016 1 Statistical Modelling and Latent Structure Much of statistical modelling attempts to identify latent structure in the

More information

COMPUTER AND ROBOT VISION

COMPUTER AND ROBOT VISION VOLUME COMPUTER AND ROBOT VISION Robert M. Haralick University of Washington Linda G. Shapiro University of Washington A^ ADDISON-WESLEY PUBLISHING COMPANY Reading, Massachusetts Menlo Park, California

More information

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20 Data mining Piotr Paszek Classification k-nn Classifier (Piotr Paszek) Data mining k-nn 1 / 20 Plan of the lecture 1 Lazy Learner 2 k-nearest Neighbor Classifier 1 Distance (metric) 2 How to Determine

More information

Credit card Fraud Detection using Predictive Modeling: a Review

Credit card Fraud Detection using Predictive Modeling: a Review February 207 IJIRT Volume 3 Issue 9 ISSN: 2396002 Credit card Fraud Detection using Predictive Modeling: a Review Varre.Perantalu, K. BhargavKiran 2 PG Scholar, CSE, Vishnu Institute of Technology, Bhimavaram,

More information

Discriminate Analysis

Discriminate Analysis Discriminate Analysis Outline Introduction Linear Discriminant Analysis Examples 1 Introduction What is Discriminant Analysis? Statistical technique to classify objects into mutually exclusive and exhaustive

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Decision Trees: Representation

Decision Trees: Representation Decision Trees: Representation Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning

More information

Preprocessing DWML, /33

Preprocessing DWML, /33 Preprocessing DWML, 2007 1/33 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains

More information

High-Performance Parallel Database Processing and Grid Databases

High-Performance Parallel Database Processing and Grid Databases High-Performance Parallel Database Processing and Grid Databases David Taniar Monash University, Australia Clement H.C. Leung Hong Kong Baptist University and Victoria University, Australia Wenny Rahayu

More information

Chapter 1. Using the Cluster Analysis. Background Information

Chapter 1. Using the Cluster Analysis. Background Information Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

K236: Basis of Data Science

K236: Basis of Data Science Schedule of K236 K236: Basis of Data Science Lecture 6: Data Preprocessing Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 1. Introduction to data science

More information

Unified Methods for Censored Longitudinal Data and Causality

Unified Methods for Censored Longitudinal Data and Causality Mark J. van der Laan James M. Robins Unified Methods for Censored Longitudinal Data and Causality Springer Preface v Notation 1 1 Introduction 8 1.1 Motivation, Bibliographic History, and an Overview of

More information

Introduction to Web Clustering

Introduction to Web Clustering Introduction to Web Clustering D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 June 26, 2009 Outline Introduction to Web Clustering Some Web Clustering engines The KeySRC approach Some

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Digital Image Processing

Digital Image Processing Digital Image Processing Third Edition Rafael C. Gonzalez University of Tennessee Richard E. Woods MedData Interactive PEARSON Prentice Hall Pearson Education International Contents Preface xv Acknowledgments

More information

Kernel Principal Component Analysis: Applications and Implementation

Kernel Principal Component Analysis: Applications and Implementation Kernel Principal Component Analysis: Applications and Daniel Olsson Royal Institute of Technology Stockholm, Sweden Examiner: Prof. Ulf Jönsson Supervisor: Prof. Pando Georgiev Master s Thesis Presentation

More information