Mike Schulte Data Scientist at the University of Pittsburgh Professor of Economics and Philosophy at Western Michigan University

Mike Schulte mike@shrewd-owl.com Data Scientist at the University of Pittsburgh Professor of Economics and Philosophy at Western Michigan University

Advanced Analytics Introduced Advanced Analytics within SQL Server and Excel R and RStudio Connecting R to SQL Server Solution Examples

Summary Statistics Historical View Traditional Business Intelligence work does a lot of this already

Fit Mathematical Models Present Day View Captures current capabilities and performances

Fit Statistical Models Forward-Looking View Captures likely outcomes for the future based on past and present outcomes

SQL Server and basic SQL statements Excel Data Mining Add-In Analysis Services and DMX R and R Services Microsoft Azure Machine Learning

library(e1071) nb_model <- naivebayes(class~.,data = products)

Wizard interface no programming required! Contained within Excel Limited capabilities Older algorithms

More flexible than the Excel add-in Integrates well with the rest of the SQL stack Limited capabilities Older algorithms Requires specialized knowledge of DMX

Statistical programming environment Open source Powerful and flexible Large user community Requires specialized knowledge of, well, R!

Academic statisticians Pharmaceutical companies Government agencies Professional consultants Business analysts Converted SAS users! More

Create a System DSN for connection Connect R to your SQL Database Pull data from SQL to R Analyze the data to create a model Operationalize the model This can still be a useful way to use R with SQL Server.

Open Administrative Tools in Control Panel

Manage the ODBC Data Sources

Create a New System Data Source Name

Choose SQL Server Native Client 11.0

Choose the SQL Server Installation You Want

Recommended: Use Windows Authentication

Install RODBC Package in R

Issue standard queries Drop, create, and fetch tables List available tables See documentation for more

Load RODBC Package and Connect to DSN

You can now issue queries from within R!

library(rodbc) Bring the Data into R channel <- odbcconnect("rconnection") autodata <- sqlquery(channel, "SELECT id, mpg, cylinders, displacement, horsepower, weight, acceleration FROM [dbo].[autodata];") trainingdata <- autodata[complete.cases(autodata),] missingdata <- autodata[!complete.cases(autodata),]

Build a Linear Regression Model and Impute automodel <- lm(mpg~horsepower+weight, data=trainingdata) missingdata$mpg <- round(predict(automodel, newdata=missingdata),1)

Update Our Database with Imputed Values for(i in 1:length(missingdata$id)){ string1 <- "UPDATE dbo.autodata SET mpg = " string2 <- as.character(missingdata$mpg[i]) string3 <- " WHERE id = " string4 <- as.character(missingdata$id[i]) querystring <- paste(paste(paste(paste(string1,string2,sep=""), string3,sep=""),string4,sep="")) sqlquery(channel,querystring) }

Note that this approach is new with SQL Server 2016.

Advantages: Data do not have to move Performance improvement (scale, parallelism) Challenges: Harder to code Harder to set up access

There are lots of use cases that fit into several categories: Association Analysis (Market Basket Analysis) Classification Estimation Simulation and Optimization Clustering And more

Products often sell well together. Some of these patterns are well established and may only be confirmed by the analysis. More unexpected patterns, like the apocryphal beer and diapers example, might be discovered too, providing additional insight.

Explore associations Confirm expected patterns Find unexpected patterns Create actionable insights

Set up periodic monitoring of known rules Detect drops in association strength and investigate

A charter fishing company wishes to determine the optimal number of boats to have in service. Too many boats will mean wasted resources, while too few boats will mean missed opportunities.

Use historical and forecast data to fit a distribution

Use the fitted distribution to project revenue for each additional boat. Decide how many boats to keep!

We would like to group countries that are economically similar to one another.

We begin with data on each country: Median GDP Growth (3 years) Population (in millions) Enabling Trade Index

setwd("c:/users/michael/desktop/demos") dfrm <- read.csv(file="clustering-demo-data.csv", header=t,stringsasfactors=f) dfrm$scgdpg <- scale(dfrm$medgdpg,center=t,scale=t) dfrm$scpop <- scale(dfrm$pop13,center=t,scale=t) dfrm$sceti <- scale(dfrm$eti,center=t,scale=t) kmc <- kmeans(dfrm[,5:7],centers=5,nstart=10)

Cluster 1: 50 Countries

Cluster 2: 42 Countries

Cluster 3: 11 Countries

Cluster 4: 2 Countries

Cluster 5: 33 Countries

What sale price should I use for Froot Loops?

Use historical data to determine lift for each price point.

Use lift to determine relative profit for each price point. Recommend a sale price to your marketing and sales teams!

Two Broad Areas of Concern: Jobs Ethics