#Azure #MicrosoftAIJourney

Size: px

Start display at page:

Download "#Azure #MicrosoftAIJourney"

Emery Douglas
5 years ago
Views:

1 #Azure #MicrosoftAIJourney

4 SQL Server + R

6 1990s Development started based on S language (created in 1980) 1993 R starts as a Research Project in University of Auckland, New Zealand First version Alpha Stable Beta V1.0 Considered by developers to be production ready V2.0 First UseR conference V3.0

8 C R A N Comprehensive R Archive Network

9 Oct ,143

10 Open Source lingua franca Analytics, Computing, Modeling CRAN Task View by Barry Rowlingson: More packages on Github and BioConductor project

11 Boxplot Bar Plot Histogram Contour Dot Plot Mosaic Scatter Latticist

12 Vectors

14 Investment insurance R is an abstract language and hence code is safe across platform and versions Unlike other big data tools such as Spark Runs on existing in production platforms SQL 2016 Spark Clusters Teradata Approachable language No need for a computer science degree Free version for learning Stable deployment Familiar tooling

15 Open Source R compared to Microsoft R Server US flight data for 20 years Linear Regression on Arrival Delay Run on 4 core laptop, 16GB RAM and 500GB SSD

19 Application Database

20 Better Collaboration & Sharing Insights Faster Time to Insight SQL Server Machine Learning Services Streamline Productivity and Deployment Better Security & Compliance

21 Windows Jobobject MSSQLSERVER Service MSSQLLAUNCHPAD Service sqlservr.exe sp_execute_external_script Named pipe launchpad.exe R/Python Launcher Windows satellite Windows Windows R/Python satellite process satellite satellite process process processes What and How to launch R/Python satellite process TCP sqlsatellite.dll

22 Pushing compute to the data

23 SQL Server

24 train <- sqlquery(connection, select * from nyctaxi_sample ) model <- glm(formula, train) Data Scientist Workstation Any R/Python IDE 2 Execution 1 3 Pull Data Model Output DB

25 cc <- RxInSqlServer( connectionstring, computecontext) rxlogit(formula, cc) 2 Execution Data Scientist Workstation 1 3 Script rx* output SQL Server 2017 SQL Server Any R/Python IDE 4 Model or Predictions Machine Learning Services R/Python Runtime

27 # Set ComputeContext cc <- RxInSqlServer(connectionString = connection_string, numtasks = num_tasks); rxsetcomputecontext(cc); # Define data source visitor_interests <- RxSqlServerData(sqlQuery = input_query, colclasses = c(book_category = "numeric", college_education = "numeric", male = "numeric", clicks_in_1 = "numeric", ),connectionstring = connection_string, usefastread = TRUE); # Train model on SQL Server i.e., push rxlogit compute to remote server logit_model <- rxlogit(book_category ~ college_education + male + clicks_in_1 +, data = visitor_interests);

31 SQL Server

33 CREATE TABLE iris_rx_data ("Sepal.Length" float not null, "Sepal.Width" float not null, "Petal.Length" float not null, "Petal.Width" float not null, "Species" varchar(100)) INSERT INTO iris_rx_data EXEC = = N'iris_data <- = = N'iris_data' --WITH RESULT SETS (("Sepal.Length" float not null, --"Sepal.Width" float not null, --"Petal.Length" float not null, --"Petal.Width" float not null, "Species" varchar(100))); ALTER TABLE iris_rx_data ADD ID INT PRIMARY KEY NOT NULL IDENTITY (1,1)

34 BEGIN TRY CREATE TABLE [dbo].[iris_rx_models] ( [model_name] [varchar](30) NOT NULL, [model] [varbinary](max) NOT NULL) END TRY BEGIN CATCH print ERROR_MESSAGE() END CATCH CREATE PROCEDURE [dbo].[generate_iris_rxbtrees_model] AS BEGIN DELETE FROM [dbo].[iris_rx_models] WHERE model_name = 'iris_rxbtrees_model' varbinary(max); EXECUTE = = N' iris.sub <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25)) iris.dtree <- rxdtree(species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris[iris.sub, ]) model <- rxserializemodel(iris.dtree, realtimescoringonly = FALSE) #realtimescoringonly - Setting this flag could reduce the model size but rxunserializemodel can no longer retrieve the RevoScaleR model rxunserializemodel(model) cat(paste0("r Process ID = ", Sys.getpid())) cat("\n") = N'@model varbinary(max) OUTPUT INSERT [dbo].[iris_rx_models] ; END;

35 ALTER PROCEDURE [dbo].[predict_species] varchar(100)) AS BEGIN varbinary(max) = (select model from iris_rx_models where model_name -- Predict species based on the specified model: exec = = N' require("revoscaler"); irismodel<-rxunserializemodel(nb_model) species<-rxpredict(irismodel, iris_rx_data[,2:5]); OutputDataSet <- cbind(iris_rx_data[1], species, iris_rx_data[6]); cat(paste0("r Process ID = ", Sys.getpid())) cat("\n") OutputDataSet <- merge(iris_rx_data, OutputDataSet) colnames(outputdataset) <- c("id", "1","2","3","4", "Species.Actual", "6","7", "Species.Expected"); OutputDataSet <- OutputDataSet; = N' select id, "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from = = N'@nb_model with result sets UNDEFINED; END; GO EXEC [predict_species] 'iris_rxbtrees_model'

36 Dataset = Rows = SQL Server 5000

37 ALTER procedure [dbo].[predict_species_stream] varchar(100)) as begin varbinary(max) = (select model from [dbo].[iris_rx_models] where model_name -- Predict species based on the specified model: exec = = N' require("revoscaler"); irismodel<-rxunserializemodel(nb_model) species<-rxpredict(irismodel, iris_rx_data[,2:5]); OutputDataSet <- cbind(iris_rx_data[1], species, iris_rx_data[6]); colnames(outputdataset) <- c("id", "Species.Actual", "Species.Expected"); cat(paste0("r Process ID = ", Sys.getpid())) cat("\n") OutputDataSet <- OutputDataSet; = N' select id, "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from = = N'@nb_model int = with result sets UNDEFINED; end; GO EXEC [predict_species_stream] 'iris_rxbtrees_model'

38 sp_execute_external = N = 1 (MAXDOP = 2)

40 --Create Fake Data SELECT TOP 0 * INTO iris_rx_data_big FROM iris_rx_data GO INSERT INTO iris_rx_data_big ([Sepal.Length], [Sepal.Width], [Petal.Length], [Petal.Width]) VALUES ( RIGHT(ABS(CHECKSUM(NEWID())), 2)/10.0, RIGHT(ABS(CHECKSUM(NEWID())), 2)/10.0, RIGHT(ABS(CHECKSUM(NEWID())), 2)/10.0, RIGHT(ABS(CHECKSUM(NEWID())), 2)/10.0 ) GO 1000 INSERT INTO iris_rx_data_big ([Sepal.Length], [Sepal.Width], [Petal.Length], [Petal.Width]) SELECT RIGHT(ABS(CHECKSUM(NEWID())), 2)/10.0, RIGHT(ABS(CHECKSUM(NEWID())), 2)/10.0, RIGHT(ABS(CHECKSUM(NEWID())), 2)/10.0, RIGHT(ABS(CHECKSUM(NEWID())), 2)/10.0 FROM iris_rx_data_big GO 10 CREATE INDEX [IXiris_rx_data_big] ON iris_rx_data_big(id)

41 ALTER PROCEDURE [dbo].[predict_species_parallel] varchar(100)) as begin varbinary(max) = (select model from iris_rx_models where model_name -- Predict species based on the specified model: exec = = N' require("revoscaler"); irismodel<-rxunserializemodel(nb_model) species<-rxpredict(irismodel, iris_rx_data[,2:5]); OutputDataSet <- cbind(iris_rx_data[1], species, iris_rx_data[6]); colnames(outputdataset) <- c("id", "Species.Actual", "Species.Expected"); cat(paste0("r Process ID = ", Sys.getpid())) cat("\n") cat(" ") cat("\n") OutputDataSet <- OutputDataSet; = = N'select id, "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from iris_rx_data_big WHERE LEFT(ID,1) BETWEEN 1 AND 5 /*Add this bit to make parallel*/ OPTION(MAXDOP = = N'@nb_model int = with result sets (("ID" INT, "Setosa" INT, "versicolor" INT, "virginica" INT, "Species" varchar(150))); END; GO SET STATISTICS XML ON EXEC [predict_species_parallel] 'iris_rxbtrees_model' SET STATISTICS XML OFF

DECLARE @model varbinary(max) = (SELECT TOP 1 model from [dbo].[iris_rx_models] WHERE model_name = 'iris_rxbtrees_model'); SELECT @model DROP TABLE IF EXISTS #TMP SELECT [Sepal.Length] *.9 AS [Sepal.

43 varbinary(max) = (SELECT TOP 1 model from [dbo].[iris_rx_models] WHERE model_name = 'iris_rxbtrees_model'); DROP TABLE IF EXISTS #TMP SELECT [Sepal.Length] *.9 AS [Sepal.Length], [Sepal.Width] *.9 AS [Sepal.Width], [Petal.Length] *.9 AS [Petal.Length], [Petal.Width]*.9 AS [Petal.Width], Species, ID INTO #tmp FROM iris_rx_data SELECT * FROM PREDICT(MODEL DATA = #tmp AS d) WITH (setosa_pred float, versicolor_pred float, virginica_pred float) AS p;

44 varbinary(max) = 0x626C6F62298B1834AEF8DA6B26B89D32475A371DEDE9328FD60DB36F45A5E2C702CB E472E9C26354AFC8A4474CFD49695C6E B F F4F3F2F F CDCC8C3F F F4F3F2F B F F F F4F3F2F F F FFF000000FF F F4F3F2F F B010003B F F4F3F2F F CDCC8C3F F F4F3F2F F FA F07FA F07F F03F C F F4F3F2F F F FFF000000FF F F4F3F2F F B020003D F F4F3F2F F CDCC8C3F F F4F3F2F B F F B A F F4F3F2F F F FFF000000FF F F4F3F2F F B030003B F F4F3F2F F CDCC8 C3F F F4F3F2F F FA F07FA F07F F03F F03F F F4F3F2F F F FFF000000FF F F4F3F2F F B040003CD F F4F3F2F F CDCC8C3F F F4F3F2F F FA F07FA F07F F03F F03F F03F D53F F03F D53F E03F4FECC44EECC4EE3F D53F E03F143BB1133BB1A33F F03F F03F D53F E53F3096FC62C92FD63F7B14AE47E17AD43F F F4F3F2F F F FFF000000FF F F4F3F2F F B D F F4F3F2F F CDCC8C3F F F4F3F2F B F F F F4F3F2F F F FFF000000FF F F4F3F2F F B C F F4F3F2F F CDCC8C3F F F4F3F2F F C C2E4C656E B C2E F F4F3F2F F F FFF000000FF F F4F3F2F F B070003DD F F4F3F2F F CDCC8C3F F F4F3F2F A F FA F07FA F07F C F0BF F0BFF8FFFFFFFFFF38403CB1133BB A A F93F F F4F3F2F F F FFF000000FF F F4F3F2F F B D F F4F3F2F F CD CC8C3F F F4F3F2F B F F F F4F3F2F F F FFF000000FF F F4F3F2F F B F F4F3F2F F CDCC8C3F F F4F3F2F B F F F F4F3F2F F F FFF000000FF F F4F3F2F F B0A B C B B B F03F F03F F03F F03F F03F F03F E C B0F F A F6C6F E B B C C2E4C656E B C2E C C2E4C656E B C2E B C C2E4C656E B C2E C C2E4C656E B C2E B B B B1A B1B DROP TABLE IF EXISTS #TMP SELECT * INTO #TMP FROM ( SELECT 5.1 [Sepal.Length], 3.4 [Sepal.Width], 1.5 [Petal.Length], 0.3 [Petal.Width] UNION SELECT 4.3, 3.1, 1.2, 0.2 UNION SELECT 6.6, 1.4, 5.3, 2.2) AS A SELECT d.*, p.* FROM PREDICT(MODEL DATA = #TMP as d) WITH(setosa_Pred float, versicolor_pred float, virginica_pred float) as p;

47 DMV sys.dm_exec_requests sys.dm_external_script_requests sys.dm_external_script_execution_ stats sys.dm_os_performance_counters Description New column: external_script_request_id Returns running external scripts, DOP & assigned user account Number of executions for rx* functions in RevoScaleR package New External Scripts performance counters

48 SELECT * FROM sys.resource_governor_external_resource_pools GO ALTER EXTERNAL RESOURCE POOL [default] WITH ( MAX_CPU_PERCENT = 90, AFFINITY CPU = AUTO, MAX_MEMORY_PERCENT = 25 ); GO ALTER RESOURCE GOVERNOR RECONFIGURE; GO SELECT * FROM sys.resource_governor_external_resource_pools

49 (SQL Server 2017) 2017

51 CD C:\Program Files\Microsoft SQL Server\140\Setup Bootstrap\SQL2017\x64\ RSetup.exe /install /component MLM /version /language 1033 /destdir "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES\library\MicrosoftML\mxLibs\x64"

52 CREATE TABLE CNNFileLocations ( [file.name] nvarchar(max), type nvarchar(max), label Int, modelname nvarchar(150)) CREATE PROCEDURE [dbo].[spcnnloadfilelocationsr] (@ModeltoTrain NVarChar(MAX)) as begin DELETE FROM CNNFileLocations WHERE modelname INSERT INTO CNNFileLocations execute = = N' root.directory.name.training <- paste("c:/r/images/", ModeltoTrain, "/Training", sep="") root.directory.name.testing <- paste("c:/r/images/", ModeltoTrain, "/testing", sep="") training.folders <- list.dirs(root.directory.name.training) root.folder.length <- nchar(root.directory.name.training) + 1 #Remove the root folder as we do not need it training.folders <- training.folders[-1] imagesdf <- data.frame(cbind( file.name = file.path(list.files(training.folders, "*.*", full.names = TRUE)), type = substr(dirname(list.files(training.folders, "*.*", full.names = TRUE)), root.folder.length + 1, 1000)), stringsasfactors = FALSE) #Create an integer label by turniing into a factor and then to an integer imagesdf$label <- as.integer(as.factor(imagesdf[[2]])) - 1 imagesdf$modelname <- ModeltoTrain OutputDataSet <- data.frame(imagesdf) = N'@ModeltoTrain ; END

53 CREATE PROCEDURE [dbo].[spcnnmodelcreate] NVarChar(MAX)) as begin nvarchar(max) = CONCAT(N'SELECT [file.name], [type], [Label] FROM CNNFileLocations WHERE modelname = N'''') CREATE TABLE [dbo].[cnnmodel]( execute sp_execute_external_script [int] = N'R' IDENTITY(1,1) NOT NULL PRIMARY = N' [Model] [varbinary](max) NULL, require(microsoftml) [ModelName] [nvarchar](150) NULL, [dt2] [datetime2](7) NOT NULL DEFAULT(GETDATE())) imagesdf <- CNNFileLocations imagesdf$file.name <- as.character(imagesdf$file.name) imagesdf$type GO <- as.character(imagesdf$type) imagesdf$label <- as.numeric(imagesdf$label) imagemodel <- rxlogisticregression( formula = Label ~ Features, data = imagesdf, NVarChar(MAX) = 'GoT') AS type = "multiclass", mltransforms = list( loadimage(vars = list(features = "file.name")), resizeimage(vars = "Features", width = 224, height = 224), extractpixels(vars = "Features"), INSERT featurizeimage(var = "Features", dnnmodel = "Resnet50")) ) CREATE PROCEDURE [dbo].[spcnnmodelinsert] (@ModeltoTrain AS TABLE (v VarBinary(MAX)) EXEC OutputDataSet <- data.frame(payload = as.raw(serialize(imagemodel, connection=null))); INSERT INTO CNNModel (Model, ModelName) SELECT = = = N'@ModeltoTrain with result sets ((model varbinary(max))); END

54 CREATE PROCEDURE [dbo].[spcnnmodelpredict] varchar(150)) AS BEGIN nvarchar(max) = CONCAT(N'SELECT [type], label FROM CNNFileLocations WHERE modelname = N''' GROUP BY [type], label ORDER BY [type]') varbinary(max); select TOP = model from CNNModel where ModelName ORDER BY dt2 DESC -- Predict species based on the specified model: exec = = N' cnn_modelu <- unserialize(cnn_model) root.directory.name.testing <- paste("c:/r/images/", ModeltoTrain, "/testing", sep="") testing.folder <- list.dirs(root.directory.name.testing) test.files <- data.frame(file.name = file.path(list.files(testing.folder, "*.*", full.names = TRUE)), stringsasfactors = FALSE) test.files[, "Label"] <- -99 # Lets use the trained model to predict the type of image prediction <- rxpredict(cnn_modelu, data = test.files, extravarstowrite = list("label", "file.name")) #Get the distinct values distinct.types <- CNNFileLocations OutputDataSet <- distinct.types #Join to find the type names prediction <- merge(prediction, distinct.types, by.x = "PredictedLabel", by.y = "label") OutputDataSet <- prediction = = = N'@cnn_model @ModeltoTrain with result sets UNDEFINED; end;

56 R Client Easily scale up a single server to a grid to handle more concurrent requests Load balancing cross compute nodes A shared pool of warmed up R shells to improve scoring performance.

57 Load Balancer Server level HA: Introduce multiple Web Nodes for Active-Active backup / recovery, via load balancer Data Store HA: leverage Enterprise grade DB, SQL Server and Postgres HA capabilities

60 Distributed R - How Does Local Compute Context? Microsoft R Server Client R IDE or commandline Predictive Algorithm Console Analyze Blocks In Parallel LOCAL CONTEXT Load Block At A Time Big Data Microsoft R Server functions A compute context defines where to process. E.g. remote context like Hadoop Map Reduce Microsoft R functions prefixed with rx Current set compute context determines processing location Copyright Microsoft Corporation. All rights reserved.

Load Block At A Time Big Data Console Predictive Algorithm Results Pack and Ship Requests to Remote Environments Algorithm Master Microsoft R Server

61 Distributed R - How Does Remote Compute Context? Microsoft R Server Client Microsoft R Server Server R IDE or commandline REMOTE CONTEXT Distribute Work, Compile Results Analyze Blocks In Parallel Load Block At A Time Big Data Console Predictive Algorithm Results Pack and Ship Requests to Remote Environments Algorithm Master Microsoft R Server functions A compute context defines where to process. E.g. remote context like Hadoop Map Reduce Microsoft R functions prefixed with rx Current set compute context determines processing location Copyright Microsoft Corporation. All rights reserved.

62 ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce Compute context R script sets where the model will run Local Parallel processing Linux or Windows ### SETUP LOCAL ENVIRONMENT VARIABLES ### mylocalcc <- localpar ### LOCAL COMPUTE CONTEXT ### rxsetcomputecontext(mylocalcc) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### linuxfs <- RxNativeFileSystem() ) AirlineDataSet <- RxXdfData( AirlineDemoSmall/AirlineDemoSmall.xdf, filesystem = linuxfs) In Hadoop myhadoopccc <- RxHadoopMR() rxsetcomputecontext(myhadoopcc) hdfsfs <- RxHdfsFileSystem() hdfsfs Functional model R script does not need to change to run in Hadoop ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxsummary(~arrdelay+dayofweek, data= AirlineDataSet, reportprogress=1) ### CrossTab the data rxcrosstabs(arrdelay ~ DayOfWeek, data= AirlineDataSet, means=t) ### Linear Model and plot hdfsxdfarrlatelinmod <- rxlinmod(arrdelay ~ DayOfWeek + 0, data = AirlineDataSet) plot(hdfsxdfarrlatelinmod$coefficients)

63 ScaleR models can be deployed from a server or edge node to run in SQL Server without any functional R model re-coding for in-database computations Compute context R script sets where the model will run Local Parallel processing Linux or Windows ### SETUP LOCAL ENVIRONMENT VARIABLES ### mysqlcon <- "Driver=SQL;SERVER=localhost;Database= RevoTester;Uid=RevoTester; pwd=######" mylocalcc <- localpar ### LOCAL COMPUTE CONTEXT ### rxsetcomputecontext(mylocalcc) ### CREATE SQL SERVER DATA SOURCE ### AirlineDemoQuery <- "SELECT * FROM AirlineDemoSmall;" AirlineDataSet <- RxOdbcData(connectionString = mysqlcon, sqlquery = AirlineDemoQuery) In SQL SERVER ### SETUP SQL Server ENVIRONMENT VARIABLES ### mysqlcc <- "Driver=SQL;SERVER=localhost;Database=RevoTester; Uid=RevoTester; pwd=######" ### SQL SERVER COMPUTE CONTEXT ### rxsetcomputecontext(mysqlcc) ### CREATE SQL SERVER DATA SOURCE ### AirlineDemoQuery <- "SELECT * FROM AirlineDemoSmall;" AirlineDataSet <- RxSqlServerData(connectionString = mysqlcc, sqlquery = AirlineDemoQuery) Functional model R script does not need to change to run in either DB ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxsummary(~arrdelay+dayofweek, data= AirlineDataSet, reportprogress=1) ### CrossTab the data rxcrosstabs(arrdelay ~ DayOfWeek, data= AirlineDataSet, means=t) ### Linear Model and plot hdfsxdfarrlatelinmod <- rxlinmod(arrdelay ~ DayOfWeek + 0, data = AirlineDataSet) plot(hdfsxdfarrlatelinmod$coefficients)

64 ScaleR models can be deployed from a server or edge node to run in Teradata without any functional R model re-coding for in-database computations Compute context R script sets where the model will run Local Parallel processing Linux or Windows ### SETUP LOCAL ENVIRONMENT VARIABLES ### mylocalcc <- localpar ### LOCAL COMPUTE CONTEXT ### rxsetcomputecontext(mylocalcc) ### CREATE LOCAL FILE-SYSTEM POINTER AND FILE OBJECT ### localfs <- RxNativeFileSystem() ) AirlineDataSet <- RxXdfData( AirlineDemoSmall.xdf, filesystem = localfs) In Teradata ### SETUP TERADATA ENVIRONMENT VARIABLES ### mytdcc <- "Driver=Teradata; DBCNAME=TeradataProd; Database=RevoTester; Uid=RevoTester; pwd=######" ### TERADATA COMPUTE CONTEXT ### rxsetcomputecontext(mytdcc) ### CREATE TERADATA DATA SOURCE ### AirlineDemoQuery <- "SELECT * FROM AirlineDemoSmall;" AirlineDataSet <- RxTeradata(connectionString = mytdcc, sqlquery = AirlineDemoQuery) Functional model R script does not need to change to run in Teradata ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxsummary(~arrdelay+dayofweek, data= AirlineDataSet, reportprogress=1) ### CrossTab the data rxcrosstabs(arrdelay ~ DayOfWeek, data= AirlineDataSet, means=t) ### Linear Model and plot hdfsxdfarrlatelinmod <- rxlinmod(arrdelay ~ DayOfWeek + 0, data = AirlineDataSet) plot(hdfsxdfarrlatelinmod$coefficients)

65 R R R R R R R R R R R Server

70 Anomaly Detection

77 Train on what is normal (single class) Model understands what it like to be normal When an item is encountered that does not fit its idea of what it is like to be normal then it is counted as an anomaly

81 B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E Ingest Store Prep & Train Model & Serve Intelligence Business apps Data Factory (Data movement, pipelines & orchestration) Cosmos DB Custom apps Kafka Blobs Data Lake Databricks HDInsight Data Lake Analytics SQL SQL Database Predictive apps Event Hub IoT Hub Machine Learning SQL Data Warehouse Operational reports Sensors and devices Analysis Services Analytical dashboards

82 A P A C H E S P A R K An unified, open source, parallel, data processing framework for Big Data Analytics Spark SQL Interactive Queries Spark MLlib Machine Learning Spark Streaming Stream processing GraphX Graph Computation Spark Core Engine Yarn Mesos Spark Structured Streaming Stream processing Standalone Spark Scheduler MLlib Machine Learning

83 S P A R K - B E N E F I T S Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for batch and real-time data processing. Developer Productivity Easy-to-use APIs for processing large datasets. Includes 100+ operators for transforming. Unified Engine Integrated framework includes higher-level libraries for interactive SQL queries, Stream Analytics, ML and graph processing. A single application can combine all types of processing. Ecosystem Spark has built-in support for many data sources, rich ecosystem of ISV applications and a large dev community. Available on multiple public clouds (AWS, Google and Azure) and multiple on-premises distributors

84 A D V A N T A G E S O F A U N I F I E D P L A T F O R M Spark Streaming Spark Machine Learning Spark SQL

85 D A T A B R I C K S - C O M P A N Y O V E R V I E W

86 A Z U R E D A T A B R I C K S Microsoft Azure

A Z U R E D A T A B R I C K S Azure Databricks Collaborative Workspace IoT / streaming data DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST Machine learning models Cloud storage Deploy Production Jobs

87 A Z U R E D A T A B R I C K S Azure Databricks Collaborative Workspace IoT / streaming data DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST Machine learning models Cloud storage Deploy Production Jobs & Workflows BI tools MULTI-STAGE PIPELINES JOB SCHEDULER NOTIFICATION & LOGS Data warehouses Optimized Databricks Runtime Engine Data exports Hadoop storage DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs Data warehouses Enhance Productivity Build on secure & trusted cloud Scale without limits

technology, any distribution Workload optimized, managed clusters Frictionless & Optimized Spark clusters Data Engineering in a

88 BIG DATA STORAGE Reduced Administration BIG DATA ANALYTICS K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S CONTROL EASE OF USE Azure Databricks Azure Data Lake Analytics Azure HDInsight Azure Marketplace HDP CDH MapR Any Hadoop technology, any distribution Workload optimized, managed clusters Frictionless & Optimized Spark clusters Data Engineering in a Job-as-a-service model IaaS Clusters Managed Clusters Big Data as-a-service Azure Data Lake Analytics Azure Data Lake Store Azure Storage

89 G E N E R A L S P A R K C L U S T E R A R C H I T E C T U R E Driver Program SparkContext Cluster Manager Worker Node Worker Node Worker Node Data Sources (HDFS, SQL, NoSQL, )

collaboration is enabled through a combination of: Fine grained permissions: Defines who can do what on which artifacts

90 S E C U R E C O L L A B O R A T I O N Azure Databricks enables secure collaboration between colleagues With Azure Databricks colleagues can securely share key artifacts such as Clusters, Notebooks, Jobs and Workspaces Secure collaboration is enabled through a combination of: Fine grained permissions: Defines who can do what on which artifacts (access control) Fine Grained Permissions AAD-based User Authentication AAD-based authentication: Ensures that users are actually who they claim to be

on Azure Databricks clusters Jobs execute either Notebooks or Jars Azure

92 J O B S Jobs are the mechanism to submit Spark application code for execution on the Databricks clusters Spark application code is submitted as a Job for execution on Azure Databricks clusters Jobs execute either Notebooks or Jars Azure Databricks provide a comprehensive set of graphical tools to create, manage and monitor Jobs.

93 D A T A B R I C K S S P A R K I S F A S T Benchmarks have shown Databricks to often have better performance than alternatives SOURCE: Benchmarking Big Data SQL Platforms in the Cloud

94 Spark ML Algorithms S P A R K M L A L G O R I T H M S

D E E P L E A R N I N G Azure Databricks supports and integrates with a number of Deep Learning libraries and frameworks to make it easy to build and deploy Deep

o Article explains how to install CNTK on Azure Databricks.

95 D E E P L E A R N I N G Azure Databricks supports and integrates with a number of Deep Learning libraries and frameworks to make it easy to build and deploy Deep Learning applications Supports Deep Learning Libraries/frameworks including: Microsoft Cognitive Toolkit (CNTK). o Article explains how to install CNTK on Azure Databricks. TensorFlowOnSpark BigDL Offers Spark Deep Learning Pipelines, a suite of tools for working with and processing images using deep learning using transfer learning. It includes high-level APIs for common aspects of deep learning so they can be done efficiently in a few lines of code: Distributed Hyperparameter Tuning Transfer Learning

100

101

102

103

104

105 Visual Studio Tools for AI Visual Studio extension with deep integration to Azure ML End to end development environment, from new project through training Support for remote training Job management On top of all of the goodness of Visual Studio (Python, Jupyter, Git, etc)

106

107

108

109 THE FASTEST TOOLKIT

110 MOST SCALABLE

111 This material is provided for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED.

Franck Mercier. Technical Solution Professional Data + AI Azure Databricks

Franck Mercier. Technical Solution Professional Data + AI Azure Databricks Franck Mercier Technical Solution Professional Data + AI http://aka.ms/franck @FranmerMS Azure Databricks Thanks to our sponsors Global Gold Silver Bronze Microsoft JetBrains Rubrik Delphix Solution OMD