Data and AI LATAM 2018
La parte de imagen con el identificador de relación rid5 no se encontró en el archivo. La parte de imagen con el identificador de relación rid5 no se encontró en el archivo. La parte de imagen con el identificador de relación rid3 no se encontró en el archivo. Streamline Productivity and Simplify Deployment Separate service or embedded logic Scoring Applications Applications Model Model Training Analytics server Data Transformations MODEL SQL Server Data Transformations Model Training Scoring MODEL
IEEE Spectrum Top Programming Languages IEEE Spectrum, July 2017 KDnuggets Top Data Science Tools, 2017
SQL Server 2017
http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
Eliminate data movement Operationalize scripts and models Enterprise grade performance and scale Extensibility
Demo Hello World
EXEC sp_execute_external_script @language =N'R', @script =N' print(paste("hello World from:", Revo.version$version.string)); ' EXEC sp_execute_external_script @language =N'Python', @script =N' import sys print ("Hellow World from:", sys.version) '
Using SQL Server 2017 Machine Learning Services
Augments R & Python with parallelized, distributed algorithms Provides in-database execution of scripts and algorithms used Parallel algorithms overcome Python and R memory limitations Reduces security risks by keeping data in-database Creates and consumes portable models
One call, one answer Arbitrarily large data sets Arbitrarily large worker task set Mathematically the same as single-threaded Platform independent Most are written in C++ for speed 1. Algorithm begins initiator process 2. Initiator distributes work to nodes 3. Finalizer collects results 4. Finalizer iterates or continues 5. Finalizer evaluates final model 6. Returns single model to calling script RevoScaleR & RevoScalePy Algorithms and Functions Load a large dataset Run a RevoScaleR or RevoScalePy algorithm Data Larger than RAM
One call One or many models returned Arbitrarily large data sets Arbitrarily large worker task set Augments RevoScaleR Fast learners Deep learning algorithms Ensemble results using rxensemble RevoScaleR & RevoScalePy Algorithms and Functions Load a large dataset Run a RevoScaleR or RevoScalePy algorithm Data Larger than RAM
Executes RevoScaleR algos on remote data & CPUs rxsetcomputecontext redirects to remote Algorithms in RevoScaleR library redirect as set Results are returned to script as though local 1. Algorithm on local checks compute context 2. If set remote, packages and ships request 3. Local script blocks (by default) awaiting response 4. Remote unpacks and executes in parallel 5. Remote returns results to local interface 6. Local interface returns results to script Load a large dataset Run a RevoScaleR or RevoScalePy algorithm SQL Server, Teradata 1, Hadoop 1, Hadoop MapReduce 1, HDInsight 1 1 R only
Supports custom, multi-layer network topology with filtered, convolutional, and pooling bundles Binary classification Multi-class classification Regression Bing Ads Click Prediction ($50M per year revenue gain); Image Classification L1, L2 regularization Binary classification Multi-class classification Easy to train learner for anomaly detection Boosted decision tree. Similar to XGBoost. Supports up to ~100K features state-of-the-art tree ensembles (Random Forest) Supports up to ~100K features. Speed, scalability and supports L1,L2 regularization. Supports up to 1B features! Anomaly Detection Binary classification Regression Binary classification Regression Binary classification, Regression Classifying user feedback Fraud detection One of the most popular and best performing learners inside Microsoft Churn Prediction Outlook used for email spam filtering Battle tested, large language support, performant (Bing, Office) Ease of use; 1 line of code to set Ease of use Performs natural language processing of free text into numerical representation Converts categories into numerical data Selects a subset of features to speed up training time Support ticket classification, Sentiment analysis Ad Click Prediction Sentiment analysis, Ad Click Prediction
Canonical deployment patterns
SQL Server 2017
Key Points: Classical pattern of pulling data out of database to a separate modeling environment Data scientists will SQL already be familiar with this approach, Server so it's something to build on 2017
Remote Execution Context SQL Server 2016/17 Results RevoScaleR & RevoScalePy Parallel Algorithms Iterate/ Sequence Parallel Worker Tasks
Remote Execution Context Key Points: Fast and friendly for existing R/Python users Results a SQL Compute Context SQL Server 2016/17 RevoScaleR & RevoScalePy Parallel Algorithms Limited to what can be done through Requires external script execute permissions Iterate/ Sequence Parallel Worker Tasks
Run R and Python from SQL environments T-SQL Apps T-SQL Script SQL Server 2016/17 Run Python & R From within the Query Processor
T-SQL Apps Key Points: Works with all of R/Python functions for maximum flexibility Most natural for users with some SQL familiarity, but doable for all Run R and Python from SQL environments T-SQL Script SQL Server 2016/17 Run Python & R From within the Query Processor
BI & Reporting; Web apps T-SQL Script Enable smart non-r apps SQL Server 2016/17 T-SQL Stored Procedure
BI & Reporting; Web apps T-SQL Script Enable smart non-r apps Key Points: Helper functions are available so you don't have to manually recode and R/Python script Allows firing via traditional triggers SQL Server 2016/17 T-SQL Stored Procedure
Production Apps T-SQL SQL Server 2017 Events Events Models Stored Proc s and Triggers Real time scoring engine
Production Apps T-SQL Key Points: R/Python need not be installed Works with many of the RevoScale* and MicrosoftML models SQL Server 2017 Events Works on SQL 2017 and Azure SQL DB Models Stored Proc s and Triggers single millisecond Events response times Real time scoring engine
Other Hints and Deployment Considerations
Integration with SQL query execution Parallel query pushing data to multiple external processes / threads Use in-memory technology and Columnstore Indexes alongside your ML scripts Streaming mode execution Stream data in batches to the R/Python process to scale beyond available memory Train and Predict using parallelism Leverage RevoScaleR/revoscalepy and scale your R and Python scripts using multi-threading and parallel processing Native scoring for faster real-time predictions (New in 2017)
No dependency between rows (ex: scoring) Trivial Parallelism exec sp_execute_external_script @language = N'R, @script = N' # unserialize model logitobj <- unserialize(modelbin); # build classification model to predict tipped or not system.time(outputdataset <- data.frame(predict(logitobj, newdata = InputDataSet, type = "response")))[3];, @input_data_1 = N SELECT tipped, passenger_count, trip_time_in_secs, trip_distance, d.direct_distance FROM dbo.nyctaxi_sample TABLESAMPLE (50 PERCENT) REPEATABLE (98074) CROSS APPLY [CalculateDistance](pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as d OPTION(MAXDOP 2) -- Needed only to control DOP, @parallel = 1, @params = N'@modelbin varbinary(max), @r_rowsperread int, @modelbin = @model, @r_rowsperread = 5000; sp_execute_exte rnal_script @script = N Predict, @parallel = 1 (MAXDOP = 2)
Requirements: No dependency between rows (ex: scoring) Key Benefits: Execute script over chunks of data Process data that doesn t fit in memory Can be used from client (rx* function) or server exec sp_execute_external_script @language = N'R, @script = N' # unserialize model logitobj <- unserialize(modelbin); # build classification model to predict tipped or not system.time(outputdataset <- data.frame(predict(logitobj, newdata = InputDataSet, type = "response")))[3];, @input_data_1 = N SELECT tipped, passenger_count, trip_time_in_secs, trip_distance, d.direct_distance FROM dbo.nyctaxi_sample TABLESAMPLE (50 PERCENT) REPEATABLE (98074) CROSS APPLY [CalculateDistance](pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as d, @params = N'@modelbin varbinary(max), @r_rowsperread int, @modelbin = @model, @r_rowsperread = 5000; Dataset = 15000 Rows Sp_execute_external_script @r_rowsperread = 5000 5000 5000 5000
exec sp_execute_external_script @language = N'R, @script = N' # Define the connection string connstr <- paste("driver=sql Server;Server=", instance_name, ";Database=", database_name, ";Trusted_Connection=true;", sep=""); # Set ComputeContext cc <- RxInSqlServer(connectionString = connstr, numtasks = 4); # Pull data from query featuredatasource = RxSqlServerData(sqlQuery = input_query, connectionstring = connstr, computecontext = cc); # Table to write data to, using compute context tippredictions = RxSqlServerData(table = "nyc_taxi_tip_predictions", connectionstring = connstr); # Unserialize model logitobj <- unserialize(modelbin); # Predict tipped or not based on model Predictions -> rxpredict(logitobj, data = featuredatasource, outdata = tippredictions, overwrite = TRUE);, @params = N'@input_query nvarchar(max), @input_query = N'SELECT * FROM nyctaxi_training_sample' sp_execute_ext ernal_script @script = N rxlogit, @input_data_1 = N SELECT. (MAXDOP = 2) rxcall <Model Object> rxcall +BxlServer +BxlServer m 1 + m 2
@model SELECT native_model FROM models WHERE model_name = 'Fraud Detection Model PREDICT MODEL = @model DATA = new_transaction
-- Check/set External Resource Pool config SELECT * FROM sys.resource_governor_resource_pools WHERE name = 'default' SELECT * FROM sys.resource_governor_external_resource_pools WHERE name = 'default' ALTER RESOURCE POOL "default" WITH (max_memory_percent = 60); ALTER EXTERNAL RESOURCE POOL "default" WITH (max_memory_percent = 80); ALTER RESOURCE GOVERNOR RECONFIGURE; -- enforce changes
DMV sys.dm_exec_requests sys.dm_external_script_requests sys.dm_external_script_execution_stats sys.dm_os_performance_counters Description New column: external_script_request_id Returns running external scripts, DOP & assigned user account Number of executions for rx* functions in RevoScaleR package New External Scripts performance counters
here
https://docs.microsoft.com/en-us/sql/advanced-analytics/r/new-components-in-sql-server-to-support-r
Reduced surface area and isolation external scripts enabled required R/Python script execution outside of SQL Server process space Script execution requires explicit permission sp_execute_external_script requires EXECUTE ' ANY EXTERNAL SCRIPT for nonadmins SQL Server login/user required and db/table access R/Python processes have limited privileges R/Python processes run under local user accounts in the SQLRUserGroup Each execution is isolated. Different users with different accounts Windows firewall rules to block outbound traffic
Examples using R: sqlpackages <- rxinstalledpackages(fields = c("package", "Version", "Built"), computecontext = sqlserver) pkgs <- c("ggplot2") rxinstallpackages(pkgs = pkgs, verbose = TRUE, scope = "private", computecontext = sqlserver) Example using T-SQL: EXEC sp_execute_external_script @language=n'r', @script=n' mypackages <- rxinstalledpackages(); OutputDataSet <- as.data.frame(mypackages); ' pkgs <- c("ggplot2") rxremovepackages(pkgs = pkgs, verbose = TRUE, scope = "private", computecontext = sqlserver)
Azure SQL Database R support Python support Machine Learning Services in SQL Server on Linux Additional algorithms and pre-trained models Native Scoring for more models
R Services ML Services AKA.MS/MLSQLDEV SSMS Reports for ML Services ML cheat sheet Hospital length of Stay demo scripts SQL Server Machine Learning Services
Muchas Gracias! mhelb@microsoft.com
Learning and Scoring Process Learning Labels Images Featurization (using pre-trained ResNet18 neural network model) Features Classification Algorithm (Boosted Tree) Classifier Model Scoring Images Featurization (using pre-trained ResNet18 neural network model) Features Classification Predictions
Distributed Featurization and Training On HD-Insight SQL Server Models Table HDInsight-MRS Azure Blob Storage CT Scan Images Classifier Training Featurization Edge Distributed Featurization
Scoring with Deep Learning Model in SQL SQL Server Web App Stored Procedures with R Code Featurization Scoring with the classifier model Stored Procedure call Model table, Features table, New Images table Diagnosis: 35% certainty
Image Featurization
Parallel Featurization (30x speedup)
Training on Spark and storing in SQL
Scoring in SQL