Mike Schulte mike@shrewd-owl.com Data Scientist at the University of Pittsburgh Professor of Economics and Philosophy at Western Michigan University
Advanced Analytics Introduced Advanced Analytics within SQL Server and Excel R and RStudio Connecting R to SQL Server Solution Examples
Summary Statistics Historical View Traditional Business Intelligence work does a lot of this already
Fit Mathematical Models Present Day View Captures current capabilities and performances
Fit Statistical Models Forward-Looking View Captures likely outcomes for the future based on past and present outcomes
SQL Server and basic SQL statements Excel Data Mining Add-In Analysis Services and DMX R and R Services Microsoft Azure Machine Learning
library(e1071) nb_model <- naivebayes(class~.,data = products)
Wizard interface no programming required! Contained within Excel Limited capabilities Older algorithms
More flexible than the Excel add-in Integrates well with the rest of the SQL stack Limited capabilities Older algorithms Requires specialized knowledge of DMX
Statistical programming environment Open source Powerful and flexible Large user community Requires specialized knowledge of, well, R!
Academic statisticians Pharmaceutical companies Government agencies Professional consultants Business analysts Converted SAS users! More
Create a System DSN for connection Connect R to your SQL Database Pull data from SQL to R Analyze the data to create a model Operationalize the model This can still be a useful way to use R with SQL Server.
Open Administrative Tools in Control Panel
Manage the ODBC Data Sources
Create a New System Data Source Name
Choose SQL Server Native Client 11.0
Choose the SQL Server Installation You Want
Recommended: Use Windows Authentication
Install RODBC Package in R
Issue standard queries Drop, create, and fetch tables List available tables See documentation for more
Load RODBC Package and Connect to DSN
You can now issue queries from within R!
library(rodbc) Bring the Data into R channel <- odbcconnect("rconnection") autodata <- sqlquery(channel, "SELECT id, mpg, cylinders, displacement, horsepower, weight, acceleration FROM [dbo].[autodata];") trainingdata <- autodata[complete.cases(autodata),] missingdata <- autodata[!complete.cases(autodata),]
Build a Linear Regression Model and Impute automodel <- lm(mpg~horsepower+weight, data=trainingdata) missingdata$mpg <- round(predict(automodel, newdata=missingdata),1)
Update Our Database with Imputed Values for(i in 1:length(missingdata$id)){ string1 <- "UPDATE dbo.autodata SET mpg = " string2 <- as.character(missingdata$mpg[i]) string3 <- " WHERE id = " string4 <- as.character(missingdata$id[i]) querystring <- paste(paste(paste(paste(string1,string2,sep=""), string3,sep=""),string4,sep="")) sqlquery(channel,querystring) }
Note that this approach is new with SQL Server 2016.
Advantages: Data do not have to move Performance improvement (scale, parallelism) Challenges: Harder to code Harder to set up access
There are lots of use cases that fit into several categories: Association Analysis (Market Basket Analysis) Classification Estimation Simulation and Optimization Clustering And more
Products often sell well together. Some of these patterns are well established and may only be confirmed by the analysis. More unexpected patterns, like the apocryphal beer and diapers example, might be discovered too, providing additional insight.
Explore associations Confirm expected patterns Find unexpected patterns Create actionable insights
Set up periodic monitoring of known rules Detect drops in association strength and investigate
A charter fishing company wishes to determine the optimal number of boats to have in service. Too many boats will mean wasted resources, while too few boats will mean missed opportunities.
Use historical and forecast data to fit a distribution
Use the fitted distribution to project revenue for each additional boat. Decide how many boats to keep!
We would like to group countries that are economically similar to one another.
We begin with data on each country: Median GDP Growth (3 years) Population (in millions) Enabling Trade Index
setwd("c:/users/michael/desktop/demos") dfrm <- read.csv(file="clustering-demo-data.csv", header=t,stringsasfactors=f) dfrm$scgdpg <- scale(dfrm$medgdpg,center=t,scale=t) dfrm$scpop <- scale(dfrm$pop13,center=t,scale=t) dfrm$sceti <- scale(dfrm$eti,center=t,scale=t) kmc <- kmeans(dfrm[,5:7],centers=5,nstart=10)
Cluster 1: 50 Countries
Cluster 2: 42 Countries
Cluster 3: 11 Countries
Cluster 4: 2 Countries
Cluster 5: 33 Countries
What sale price should I use for Froot Loops?
Use historical data to determine lift for each price point.
Use lift to determine relative profit for each price point. Recommend a sale price to your marketing and sales teams!
Two Broad Areas of Concern: Jobs Ethics