Machine Learning Project Report

Size: px
Start display at page:

Download "Machine Learning Project Report"

Transcription

1 Machine Learning Project Report 1. Project Objective The project aims to analyze the crime data provided by CMPD and design a predictive model implementing classification, clustering, and neuro network algorithms to predict in how many days a case can be closed. Additionally, we tried to find out if the tweets that mention CMPD have any relation to our Incident data. Built predictive models based on a series of inputs: Historical geographic crime patterns Day of week and time of day Weather conditions Special Events Tweeter Data Unemployment 2. Data retrieval Predictive Crime Analytics Madlen Ivanova Mansi Dubey Praneesh Jayaraj University of North Carolina at Charlotte

2 Contents 1. Project Objective Data sources Data retrieval Data cleansing and Preparation Enriched the Incident dataset with multiple third-party data sets Joining Data Data Evaluation Data Exploration Feature engineering Modeling References P a g e

3 1. Project Objective The project aims to analyze the crime data provided by CMPD and design predictive models implementing classification and neural networks to predict: In how many days a case can be closed Number of crimes to occur Case status of an Incident Our goal is to build predictive models based on different inputs, evaluate them and choose the best one. 2. Data sources Our main dataset was provided by CMPD. Here is a list with the additional input data and its source: Day of week and time of day feature engineering Weather conditions - Special Events data.gov Unemployment United States Department of Labor Twitter data - IBM Watson Analytics for Social Media 3. Data retrieval It took many s and couple of meetings with CMPD (Charlotte-Mecklenburg Police Department) to discuss the process that we had to go through in order to be allowed access to the CMPD Incident data. The data was provided to us through the CMPD web site. Username and password were assigned to us so we can securely get the data. The data was provided as plain text but it was not available to download. It took us 1 week to find a way to extract it. Since none of the software we tried worked properly, we had to use a C# program to connect to the CMPD web site, establish a secure connection and download the data. The data was imported into an MS SQL Server for further analysis. The Data arrived in plain text format. It was spread from year 2011 to year 2016, where we had 7 tables per year. Overall, we retrieved 42 tables (6 years X 7 tables). We used SQL Management Studio to merge all the years in order to create just 7 tables. We had to create and design proper database tables to make the data fit better into the appropriate data types. The Complain_No 3 P a g e

4 Colum contained the date/time value in plain text, so we had to read the column and extract date and time into separate columns. We received 7 tables altogether and managed to link them. Here is our entity relationship diagram: 4. Data cleansing and Preparation We found a lot of discrepancies in the data. It looks like the source database system allowed any text to be entered into any column (no client side data validations and no back-end data type enforcements). For example, it was normal to see CityName in the ZipCode column, and it was normal to see the cities misspelled a lot. We chose which fields we were going to use in our analysis and did basic data cleansing. There were small amounts of records (less than 1%) that did not contain information that we can work with (no city and zip, so we could not match the records to proper city) and we removed them. We also deleted all columns, where we had more than 50% missing values. We used multiple imputation to analyze the completeness of our dataset. 4 P a g e

5 The data format was specified. The type of numeric field (ordinal and continuous) was adjusted properly. Outliers were replaced with the mean value of the field. Outlier cut off value was set to 3 standard deviations. All the missing data entries were replaced with: Continuous fields: replace with mean Nominal fields: replace with mode Ordinal fields: replace with median Dates & Times cannot be used directly by most algorithms, but durations can be computed and used as model features, so we estimated the duration period. The features with too many missing values were excluded (> 50%). The rows with too many missing values were excluded (> 50%). The fields with too many unique categories were excluded (> 100 categories). The categorical fields with too many values in a single category were excluded (> 90%). Sparse categories were merged to maximize association with target. Input fields that have only one category after supervised merging are excluded. The dataset was partitioned to training (70%), testing (15%), and validation (15%). We created several views that allowed us to only view the data that we were interested in. 5. Enriched the Incident dataset with multiple third-party data sets We merged and exported the most interesting features in a single excel file for modeling. However, the data that we have could not be used for predictive modeling as there were not many attributes which could be used for prediction. We augmented multiple datasets to make our data more insightful. We have used weather, special events, twitter, and unemployment data. Unemployment data contains unemployment rate and labor force details for each month from 2011 to 2016 in Charlotte area. We have considered 1 Month net change when downloading the data. The weather data contained the following features: Max TemperatureF,Mean TemperatureF, Min TemperatureF Max Dew PointF, MeanDew PointF, Min DewpointF Max Humidity, Mean Humidity, Min Humidity Max Sea Level PressureIn, Mean Sea Level PressureIn, Min Sea Level PressureIn Max VisibilityMiles, Mean VisibilityMiles, Min VisibilityMiles] Max Wind SpeedMPH, Mean Wind SpeedMPH, Max Gust SpeedMPH, PrecipitationIn, CloudCover Events, WindDirDegrees 5 P a g e

6 Since we worked mostly at a day level (not on hour level) and we did not have the correct time of the Incident, we considered only the mean values. The Special Event dataset contained start and end date of the event, along with location and description of the event. The Twitter dataset contained date, number of tweets (per day), and positive/negative sentiment analysis and it was manually collected using IBM Watson Analytics for Social Media. 6. Joining Data We used many different tables for our model. The Charlotte Mecklenburg Police Department has provided multiple tables that include Incident related information. The Incident table is the main table that connects to the additional tables offenses, Property, Stolen_Vehicle, Victim_Business, Victim_Person and Weapons. All tables are linked by the column Complaint_No. The Complaint No column has encoded information contain the date, time and incremental number which makes each record unique. Linking to the remaining tables from CMPD is easy as the same format is used in the other tables. To link the CMPD data to any other data, we broke down the Complaint_No column into a Date column, DateTime column, and separately we created Year, Month, Day columns. Additional data about Unemployment, Weather, Twitter, and Special Events datasets were linked to the CMPD data based on date. The CMPD data include details from outside the Charlotte area, however our research is limited to the city of Charlotte. To filter the data, we have created a table ZipCodes and we have included all of the ZipCodes that are for the Charlotte area. This way, we can link all the data to the ZipCodes table and filter it on the Charlotte zip codes. Additionally, some of the data came at the Day level, and some data came at the Month level. We had to group data at the month level so we can properly link it by date (year, month). We performed analysis on both (year, month) levels. 7. Data Evaluation We used Tableau, IBM Watson Analytics, and SPSS to learn, ensure data validity, and perform descriptive analysis. SPSS: We used the Descriptives, Descriptive Statistics and the Frequencies command to determine percentiles, quartiles, measures of dispersion, measures of central tendency (mean, median, and mode), measures of skewness, and to create histograms. We used Tableau to better visualize the data that we have. Since we worked with little more than half a million records, we had to livestream the data from the MS SQL virtual cloud Server, otherwise, the tool was constantly crashing. 6 P a g e

7 IBM Watson Analytics was used also to create visualizations and find dependencies between variables. This tool did not support livestreaming from another server, so we had to upload our dataset in IBM s cloud space. We performed a lot of frequency analysis to better understand the distribution of our data. Here are some of the interesting data visualizations: Type of incident distribution: Day of week over vehicle theft: 7 P a g e

8 Day of week over homicide: Analysis of the number of the distinct Copmplint_No for each table in the CMPD database: 8 P a g e

9 Analysis of the vehicle body type that have been stolen most frequently per zip code: The trend of Number of Tweets over Week Day by Location Type: 9 P a g e

10 Number of Incidents compared by year and day of the week: Trend of the number of Incidents over the Incident hour and the case status: Number of incidents over mean temperature by location type: 10 P a g e

11 The time needed for a case to be resolved over incident hour and location type: 8. Data Exploration We ran correlation analysis on the numeric features of the weather data. We looked at the pair of values and we removed one feature of each pair of highly correlated ones. Then, we ran it again to confirm that we don t have 1:1 correlation in our dataset. For instance, considering the correlations between the features below, we removed the Max Gust SpeedMPH and the Mean Humidity features from our excel file. 11 P a g e

12 Then we ran linear regression in SPSS and we used the VIF (Variance Inflation Factors) value to detect multicollinearity between variables. For example, in the left picture below you can see that the Mean TempretureF and the MeanDew PointF highly influence each other. After removing the Mean Dew PointF, you can see that the VIF score of Mean Temperature become normal and it is below our threshold, which is Feature engineering Feature engineering is fundamental to the application of machine learning. To improve our initial results, we used Microsoft SQL Server Management Studio (SSMS) to create the following features: We created Day of week and time of day features Time interval for a case to be closed was calculated from the reported and clearance date 10. Modeling Method #1: Classification After the extensive data analysis, we started with data modeling. For the classification, we chose Decision Tree Modeling. Decision Tree classifier is a supervised learning algorithm, which creates a model to predict the class labels by learning decision rules based on the data features. We created the decision tree model using IBM SPSS Modeler tool. To measure the quality of the split, we applied the Gini function for the information gain. Since the predictors are categorical, the model uses multi-way splits and we have set the minimum change in Gini to To improve the model accuracy, we have used boosting, considering 10 component models. We have chosen to favor accuracy over stability, to create a model that can accurately predict the target variable. The 12 P a g e

13 model aims at minimizing the misclassification cost. Stopping rule for building the tree is based on the minimum percentage of records in percent (5%) and child (3%) branch. There are many continuous features in the data such as Time-Hour, mean temperature, humidity etc, which increase the learning time of the model as well as decreases the model accuracy and performance. Hence, we have converted these features into interval-scaled variables. We have analyzed the effectiveness of these features on the classification task and considered those intervals that showcased significant patterns in predicting the target variable. Mean Temperature converted into following categories: We extracted incident hour from Incident date feature and converted into categorical feature as: 00:00-03:59 - Midnight 04:00-07:59 - EarlyMorning 08:00-11:59 - Morning 12:00-15:59 - Afternoon 16:00-19:59 - Evening 20:00-23:59 - Night We have dropped the highly-correlated attributes and the ones which do not have significant predictor performance rating. After multiple iterations and different combinations of input features, we have implemented classification using Decision Tree algorithm to classify/predict Case_Status for an incident. We partitioned the data into training (70%) and testing (30%) datasets. Model was trained using the training data and then its performance was analyzed on test data for each trial using coincident matrix and error rate calculation. If the percentage of correct predictions was less than 50% we discarded that model and refined it by dropping the features which had low rating on Predictor importance chart. We have considered the features which will be available at the time when the incident is reported like weekday, time, place where the crime has occurred, the agency it has been reported to, location type like indoor/outdoor, temperature etc. Following are the details of our classification model: Target: Case_Status - Case_Status is Status of report (Closed/Cleared, Closed/Leads Exhausted, Further Investigation, Inactive) at time of last update (September, 2016). Predictors: WeekDay, Month, TimeFrame, Place1, Reporting_Agency, Location_Type, Temp_Range, Events 13 P a g e

14 Partition: Training Dataset (70%) and Test dataset (30%) Rules: The predictor variables for Case_Status are classified by its importance as following: 14 P a g e

15 We tried with different settings by increasing the allowed depth for the decision tree and number of component models to be used for boosting to 20. With these updated settings, the accuracy of the model jumped to almost 70%. In future we plan continue pruning the model and better the performance. Evaluation: The predictors as well as the training data size affect the performance of the model. We have iterated over the model learning process with different feature sets and different partitions. The final result we got achieves 52.6% accuracy and is not a very good model. But we can utilize different feature selection techniques and bagging and boosting methodology to improve the accuracy of the model. 15 P a g e

16 Future Work: We will focus on improving the performance of the decision tree model using different feature selections and try different tools and classification algorithms and compare the results to our Decision tree model. Method #2: Neural Networks Neural networks are the preferred tool for many predictive data mining applications because of their power and flexibility. To implement Neural Networks, we created dummy variables to transform categorical variables into numeric. After we pre-processed the data being fed into the neural network by removal of redundant information, multicollinearity and outlier treatment, and all the other processes, mentioned in section # 4, we ran the model. We used one of the most standard algorithms for this type of supervised learning - Multilayer perceptron (MLP). MLP is a function of predictors (or independent variables) that minimize the prediction error of target variable/s. Connection weights have been changed after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is how the learning occurs in the perceptron, carried out through "backward propagation of errors", attempting to minimize the loss function. To link the weighted sums of units in a layer to the values of units in the succeeding layer we used the sigmoid activation function σ(x) = 1/(1+e^( x)), which takes a realvalued input (the signal strength after the sum) and squashes it to range between 0 and 1. We partitioned the dataset into training sample, test sample, and validation (holdout) sample with ratio 70:15:15. The training sample comprises the data records used to train the neural network. The testing sample is an independent set of data records used to track errors during training to prevent overtraining. The holdout sample is second independent set of data records used for assessing the final neural network. The error for the holdout sample is the one that gives an "honest" estimate of the predictive ability of the model because the holdout cases were not used to build the model. Models: Model A day level. The data used was at a level day. As a target variable, we used the time interval for a case to be resolved (called Clearance_Timeframe). Initially, we set the random sample size to 20,000 (about 4 % of our data). Stopping rule for building the neural network was the point where error cannot be further decreased. We tried different combinations of inputs to compare them and find the best fit. Additionally, we used different sample sizes and partitioning ratio. We tried to improve the performance by feature engineering, since MLP has sensitivity to parameters, but it wasn t very helpful. Sample increase and dropout of variables were some of the most useful ways to improve the model. 16 P a g e

17 Results: The results, were still not very good. The following list represents features by importance of the model which have the highest accuracy assessed: 17 P a g e

18 Model B- month level The unemployment data was given at a month level. To match the level, we had to aggregate our data and bring it up to the month level as well. Thus, we could not use the categorical variables, because aggregating them was not appropriate. All other values were either summed, averaged or counted. After preparing the data, we fed it into our model. As target variables, we used the time interval for a case to be resolved and the number of incidents. Results: The accuracy is better than the previous model and the UnemploymentRate feature seems to have significant impact. The predictor variables for NumberOfIncidents were classified by its importance as following: 18 P a g e

19 19 P a g e

20 The results on this aggregation level were interesting. They show that the average monthly time for a case outcome is explained best of the number of incidents, the year, the average incident hour and the #of Tweets. The predictor variables for ClearanceTimerfame were classified by its importance as following: 20 P a g e

21 Further Evaluation: The accuracy of the models depends on many factors such as the sample size, the ratio between the sample size and the number of features used, the relationship between features, initial weights and biases, the target variable, and the divisions of data into training, validation, and test sets. These different conditions can lead to very different solutions for the same problem. For example, if we change the ratio of the training, validation, and test sets to 50: 25: 25, the accuracy becomes 93%. The result is similar when we include all the features that we did not consider because of multicollinearity. Additionally, the model accuracy can be enhanced by boosting and the model stability - by bagging. In Model A, the relatively low performance of the model can be explained by the number of missing values in the sample dataset, noise, and the selection of features. When we ran the boosting option, it was created an ensemble, which generates a sequence of models to obtain more accurate prediction. The accuracy of model A jumped from 63.3% to 77.5%. The accuracy of model B reached 97.8% for the NumberOfIncidents target variable and 99.9% for ClearanceTimeFrame target variable. Then we ran the bagging option, it was created an ensemble using bagging, or bootstrap aggregating, which generates multiple models to obtain more reliable predictions. The accuracy of the best generated prediction for model A became 70% and the accuracy of model B reached 90.8% for the NumberOfIncidents target variable and 97.4 % for ClearanceTimeFrame target variable. Following is a table comparing the percent of accuracy of the algorithms used: MLP + Bagging + Boosting Model A ClearanceTimeFrame 63.3% 70% 77.5%. Model B ClearanceTimeFrame 77.6% 97.4 % 99.9% Model B NumberOfIncidents 76% 90.8% 97.8% 21 P a g e

22 Future work: We can consider adding new data to our existing models, such as the most recent criminal activity (within past hours), recent call for service activity, school data, and house and rent prices. We could also perform analysis per zip code. Deep learning implementation is another option. Since it can be trained in an unsupervised or supervised manner for both unsupervised and supervised learning tasks, we could use it to train a deep network in an unsupervised manner, before training the network in a supervised one. 11. References Flatch, P. (n.d.). Machine Learning. (n.d.). Why Does Unsupervised Pre-training Help Deep Learning? (n.d.). Neural network studies. 1. Comparison of overfitting and overtraining. (n.d.). Boosting. (n.d.). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. (n.d.). IBM SPSS Neural Networks 22. University, N. (n.d.). Deep learning basics. Yan-yan SONG, Y. L. (n.d.). Decision tree methods: applications for classification and prediction. 22 P a g e

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course: DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce Overview Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Core Ideas in Data Mining Classification Prediction Association Rules Data Reduction Data Exploration

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Data Mining Lecture 8: Decision Trees

Data Mining Lecture 8: Decision Trees Data Mining Lecture 8: Decision Trees Jo Houghton ECS Southampton March 8, 2019 1 / 30 Decision Trees - Introduction A decision tree is like a flow chart. E. g. I need to buy a new car Can I afford it?

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on

More information

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA INSIGHTS@SAS: ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA AGENDA 09.00 09.15 Intro 09.15 10.30 Analytics using SAS Enterprise Guide Ellen Lokollo 10.45 12.00 Advanced Analytics using SAS

More information

Clustering algorithms and autoencoders for anomaly detection

Clustering algorithms and autoencoders for anomaly detection Clustering algorithms and autoencoders for anomaly detection Alessia Saggio Lunch Seminars and Journal Clubs Université catholique de Louvain, Belgium 3rd March 2017 a Outline Introduction Clustering algorithms

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Logical Rhythm - Class 3. August 27, 2018

Logical Rhythm - Class 3. August 27, 2018 Logical Rhythm - Class 3 August 27, 2018 In this Class Neural Networks (Intro To Deep Learning) Decision Trees Ensemble Methods(Random Forest) Hyperparameter Optimisation and Bias Variance Tradeoff Biological

More information

GETTING STARTED WITH DATA MINING

GETTING STARTED WITH DATA MINING GETTING STARTED WITH DATA MINING Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIR Forum 2017 Washington, D.C. 1 Using Data

More information

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Neural Networks Classifier Introduction INPUT: classification data, i.e. it contains an classification (class) attribute. WE also say that the class

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core) Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data

More information

Measures of Dispersion

Measures of Dispersion Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office) SAS (Base & Advanced) Analytics & Predictive Modeling Tableau BI 96 HOURS Practical Learning WEEKDAY & WEEKEND BATCHES CLASSROOM & LIVE ONLINE DexLab Certified BUSINESS ANALYTICS Training Module Gurgaon

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis

Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis Slides by: Joseph E. Gonzalez, Deb Nolan, & Joe Hellerstein jegonzal@berkeley.edu deborah_nolan@berkeley.edu hellerstein@berkeley.edu? Last

More information

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta Six Core Data Wrangling Activities An introductory guide to data wrangling with Trifacta Today s Data Driven Culture Are you inundated with data? Today, most organizations are collecting as much data in

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

CS 229 Project Report:

CS 229 Project Report: CS 229 Project Report: Machine learning to deliver blood more reliably: The Iron Man(drone) of Rwanda. Parikshit Deshpande (parikshd) [SU ID: 06122663] and Abhishek Akkur (abhakk01) [SU ID: 06325002] (CS

More information

Deep Model Compression

Deep Model Compression Deep Model Compression Xin Wang Oct.31.2016 Some of the contents are borrowed from Hinton s and Song s slides. Two papers Distilling the Knowledge in a Neural Network by Geoffrey Hinton et al What s the

More information

Lecture 20: Bagging, Random Forests, Boosting

Lecture 20: Bagging, Random Forests, Boosting Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter 8 STATS 202: Data mining and analysis November 13, 2017 1 / 17 Classification and Regression trees, in a nut shell Grow the tree by recursively

More information

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set. Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean the sum of all data values divided by the number of values in

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Data Mining Concepts

Data Mining Concepts Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential

More information

Introduction to Automated Text Analysis. bit.ly/poir599

Introduction to Automated Text Analysis. bit.ly/poir599 Introduction to Automated Text Analysis Pablo Barberá School of International Relations University of Southern California pablobarbera.com Lecture materials: bit.ly/poir599 Today 1. Solutions for last

More information

Using Machine Learning for Classification of Cancer Cells

Using Machine Learning for Classification of Cancer Cells Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.

More information

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree World Applied Sciences Journal 21 (8): 1207-1212, 2013 ISSN 1818-4952 IDOSI Publications, 2013 DOI: 10.5829/idosi.wasj.2013.21.8.2913 Decision Making Procedure: Applications of IBM SPSS Cluster Analysis

More information

Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis OrderNum ProdID Name OrderId Cust Name Date 1 42 Gum 1 Joe 8/21/2017 2 999 NullFood 2 Arthur 8/14/2017 2 42 Towel 2 Arthur 8/14/2017 1/31/18 Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

Practical Guidance for Machine Learning Applications

Practical Guidance for Machine Learning Applications Practical Guidance for Machine Learning Applications Brett Wujek About the authors Material from SGF Paper SAS2360-2016 Brett Wujek Senior Data Scientist, Advanced Analytics R&D ~20 years developing engineering

More information

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC 2018 Storage Developer Conference. Dell EMC. All Rights Reserved. 1 Data Center

More information

Name Date Types of Graphs and Creating Graphs Notes

Name Date Types of Graphs and Creating Graphs Notes Name Date Types of Graphs and Creating Graphs Notes Graphs are helpful visual representations of data. Different graphs display data in different ways. Some graphs show individual data, but many do not.

More information

SAS Visual Analytics 8.1: Getting Started with Analytical Models

SAS Visual Analytics 8.1: Getting Started with Analytical Models SAS Visual Analytics 8.1: Getting Started with Analytical Models Using This Book Audience This book covers the basics of building, comparing, and exploring analytical models in SAS Visual Analytics. The

More information

1 Topic. Image classification using Knime.

1 Topic. Image classification using Knime. 1 Topic Image classification using Knime. The aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Data Analytics Training Program

Data Analytics Training Program Data Analytics Training Program In exclusive association with 1200+ Trainings 20,000+ Participants 10,000+ Brands 45+ Countries [Since 2009] Training partner for Who Is This Course For? Programers Willing

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Engineering the input and output Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Attribute selection z Scheme-independent, scheme-specific

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

CHAPTER 3: Data Description

CHAPTER 3: Data Description CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a

More information

SAS Enterprise Miner : Tutorials and Examples

SAS Enterprise Miner : Tutorials and Examples SAS Enterprise Miner : Tutorials and Examples SAS Documentation February 13, 2018 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2017. SAS Enterprise Miner : Tutorials

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN Presentation Overview - Background - Preprocessing - Data Mining Methods to Determine Outliers - Finding Outliers - Outlier Validation -Summary

More information

Downloaded from

Downloaded from UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Paper SAS Taming the Rule. Charlotte Crain, Chris Upton, SAS Institute Inc.

Paper SAS Taming the Rule. Charlotte Crain, Chris Upton, SAS Institute Inc. ABSTRACT Paper SAS2620-2016 Taming the Rule Charlotte Crain, Chris Upton, SAS Institute Inc. When business rules are deployed and executed--whether a rule is fired or not if the rule-fire outcomes are

More information

USING CONVEX PSEUDO-DATA TO INCREASE PREDICTION ACCURACY

USING CONVEX PSEUDO-DATA TO INCREASE PREDICTION ACCURACY 1 USING CONVEX PSEUDO-DATA TO INCREASE PREDICTION ACCURACY Leo Breiman Statistics Department University of California Berkeley, CA 94720 leo@stat.berkeley.edu ABSTRACT A prediction algorithm is consistent

More information

Data warehouses Decision support The multidimensional model OLAP queries

Data warehouses Decision support The multidimensional model OLAP queries Data warehouses Decision support The multidimensional model OLAP queries Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Measures of Central Tendency

Measures of Central Tendency Page of 6 Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean The sum of all data values divided by the number of

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. Population: Census: Biased: Sample: The entire group of objects or individuals considered

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

JMP Clinical. Release Notes. Version 5.0

JMP Clinical. Release Notes. Version 5.0 JMP Clinical Version 5.0 Release Notes Creativity involves breaking out of established patterns in order to look at things in a different way. Edward de Bono JMP, A Business Unit of SAS SAS Campus Drive

More information

Big Data Analytics The Data Mining process. Roger Bohn March. 2017

Big Data Analytics The Data Mining process. Roger Bohn March. 2017 Big Data Analytics The Data Mining process Roger Bohn March. 2017 Office hours RB Tuesday + Thursday 5:10 to 6:15. Tuesday = office rm 1315; Thursday = Peet s Sai Kolasani =? 1

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

7. Boosting and Bagging Bagging

7. Boosting and Bagging Bagging Group Prof. Daniel Cremers 7. Boosting and Bagging Bagging Bagging So far: Boosting as an ensemble learning method, i.e.: a combination of (weak) learners A different way to combine classifiers is known

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information