Data and AI LATAM 2018

Similar documents
Boost your Analytics with ML for SQL Nerds

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

Microsoft, Open Source, R: You Gotta be Kidding Me!

Boost your Analytics with Machine Learning for SQL Nerds. Julie mssqlgirl.com

#Azure #MicrosoftAIJourney

Andrea Martorana Tusa. Failure prediction for manifacturing industry

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Modeling. Preparation. Operationalization. Profile Explore. Model Testing & Validation. Feature & Algorithm Selection. Transform Cleanse Denormalize

Overview of Data Services and Streaming Data Solution with Azure

Understanding the latent value in all content

Populating the Galaxy Zoo

R Language for the SQL Server DBA

Indira Bandari. Predictive Analytics using R in SQL Server

Microsoft vision for a new era

Scaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1

Integrate MATLAB Analytics into Enterprise Applications

exam. Number: Passing Score: 800 Time Limit: 120 min File Version: Microsoft

Integrate MATLAB Analytics into Enterprise Applications

CloudSwyft Learning-as-a-Service Course Catalog 2018 (Individual LaaS Course Catalog List)

BIG DATA COURSE CONTENT

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Tackling Big Data Using MATLAB

Noviembre18, 2017 Concepción, Chile. #sqlsatconce

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Scalable Machine Learning in R. with H2O

Integrate MATLAB Analytics into Enterprise Applications

Deploying, Managing and Reusing R Models in an Enterprise Environment

Oracle Big Data Connectors

Alexander Klein. #SQLSatDenmark. ETL meets Azure

SQL Server 2019 Big Data Clusters

Introduction to MATLAB application deployment

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Week 1 Unit 1: Introduction to Data Science

GPU Accelerated Data Processing Speed of Thought Analytics at Scale

Stages of Data Processing

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

SQL Server on Linux and Containers

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Database Integrated Analytics using R: Initial Experiences with SQL-Server + R

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

As a reference, please find a version of the Machine Learning Process described in the diagram below.

microsoft

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

SQL Server SQL Server 2008 and 2008 R2. SQL Server SQL Server 2014 Currently supporting all versions July 9, 2019 July 9, 2024

Database Administration for Azure SQL DB

Netezza The Analytics Appliance

Graph Analytics and Machine Learning A Great Combination Mark Hornick

Why data science is the new frontier in software development

Enable IoT Solutions using Azure

white paper Aster Data ncluster In - database Analytics with R

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Přehled novinek v SQL Server 2016

ECS289: Scalable Machine Learning

DATA SCIENCE USING SPARK: AN INTRODUCTION

17/05/2017. What we ll cover. Who is Greg? Why PaaS and SaaS? What we re not discussing: IaaS

##SQLSatMadrid. Project [Vélib by Cortana]

Webinar Series TMIP VISION

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

Build a system health check for Db2 using IBM Machine Learning for z/os

Evolving To The Big Data Warehouse

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

Vinnie Saini Cloud Solution Architect Big Data & AI

S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande

Monitoring & Tuning Azure SQL Database

MLeap: Release Spark ML Pipelines

Modern Data Warehouse The New Approach to Azure BI

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Azure Data Factory VS. SSIS. Reza Rad, Consultant, RADACAD

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Azure SQL Database Training. Complete Practical & Real-time Trainings. A Unit of SequelGate Innovative Technologies Pvt. Ltd.

Exam Questions

Approaching the Petabyte Analytic Database: What I learned

Transforming Transport Infrastructure with GPU- Accelerated Machine Learning Yang Lu and Shaun Howell

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Computational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich

Microsoft Exam

Azure SQL Database. Indika Dalugama. Data platform solution architect Microsoft datalake.lk

Columnstore Technology Improvements in SQL Server Presented by Niko Neugebauer Moderated by Nagaraj Venkatesan

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

My name is Brian Pottle. I will be your guide for the next 45 minutes of interactive lectures and review on this lesson.

Spark, Shark and Spark Streaming Introduction

HDInsight > Hadoop. October 12, 2017

SQL Server 2017: Data Science with Python or R?

COPYRIGHT DATASHEET

Prepare. Model. Operationalize

NVIDIA DEEP LEARNING INSTITUTE

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Cloudera s Enterprise Data Hub on the Amazon Web Services Cloud: Quick Start Reference Deployment October 2014

Microsoft certified solutions associate

Oskari Heikkinen. New capabilities of Azure Data Factory v2

Informatica Enterprise Information Catalog

Big Data con MATLAB. Lucas García The MathWorks, Inc. 1

Data and AI LATAM 2018

Index. Pranab Mazumdar, Sourabh Agarwal, Amit Banerjee 2016 P. Mazumdar et al., Pro SQL Server on Microsoft Azure, DOI /

Transcription:

Data and AI LATAM 2018

La parte de imagen con el identificador de relación rid5 no se encontró en el archivo. La parte de imagen con el identificador de relación rid5 no se encontró en el archivo. La parte de imagen con el identificador de relación rid3 no se encontró en el archivo. Streamline Productivity and Simplify Deployment Separate service or embedded logic Scoring Applications Applications Model Model Training Analytics server Data Transformations MODEL SQL Server Data Transformations Model Training Scoring MODEL

IEEE Spectrum Top Programming Languages IEEE Spectrum, July 2017 KDnuggets Top Data Science Tools, 2017

SQL Server 2017

http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster

Eliminate data movement Operationalize scripts and models Enterprise grade performance and scale Extensibility

Demo Hello World

EXEC sp_execute_external_script @language =N'R', @script =N' print(paste("hello World from:", Revo.version$version.string)); ' EXEC sp_execute_external_script @language =N'Python', @script =N' import sys print ("Hellow World from:", sys.version) '

Using SQL Server 2017 Machine Learning Services

Augments R & Python with parallelized, distributed algorithms Provides in-database execution of scripts and algorithms used Parallel algorithms overcome Python and R memory limitations Reduces security risks by keeping data in-database Creates and consumes portable models

One call, one answer Arbitrarily large data sets Arbitrarily large worker task set Mathematically the same as single-threaded Platform independent Most are written in C++ for speed 1. Algorithm begins initiator process 2. Initiator distributes work to nodes 3. Finalizer collects results 4. Finalizer iterates or continues 5. Finalizer evaluates final model 6. Returns single model to calling script RevoScaleR & RevoScalePy Algorithms and Functions Load a large dataset Run a RevoScaleR or RevoScalePy algorithm Data Larger than RAM

One call One or many models returned Arbitrarily large data sets Arbitrarily large worker task set Augments RevoScaleR Fast learners Deep learning algorithms Ensemble results using rxensemble RevoScaleR & RevoScalePy Algorithms and Functions Load a large dataset Run a RevoScaleR or RevoScalePy algorithm Data Larger than RAM

Executes RevoScaleR algos on remote data & CPUs rxsetcomputecontext redirects to remote Algorithms in RevoScaleR library redirect as set Results are returned to script as though local 1. Algorithm on local checks compute context 2. If set remote, packages and ships request 3. Local script blocks (by default) awaiting response 4. Remote unpacks and executes in parallel 5. Remote returns results to local interface 6. Local interface returns results to script Load a large dataset Run a RevoScaleR or RevoScalePy algorithm SQL Server, Teradata 1, Hadoop 1, Hadoop MapReduce 1, HDInsight 1 1 R only

Supports custom, multi-layer network topology with filtered, convolutional, and pooling bundles Binary classification Multi-class classification Regression Bing Ads Click Prediction ($50M per year revenue gain); Image Classification L1, L2 regularization Binary classification Multi-class classification Easy to train learner for anomaly detection Boosted decision tree. Similar to XGBoost. Supports up to ~100K features state-of-the-art tree ensembles (Random Forest) Supports up to ~100K features. Speed, scalability and supports L1,L2 regularization. Supports up to 1B features! Anomaly Detection Binary classification Regression Binary classification Regression Binary classification, Regression Classifying user feedback Fraud detection One of the most popular and best performing learners inside Microsoft Churn Prediction Outlook used for email spam filtering Battle tested, large language support, performant (Bing, Office) Ease of use; 1 line of code to set Ease of use Performs natural language processing of free text into numerical representation Converts categories into numerical data Selects a subset of features to speed up training time Support ticket classification, Sentiment analysis Ad Click Prediction Sentiment analysis, Ad Click Prediction

Canonical deployment patterns

SQL Server 2017

Key Points: Classical pattern of pulling data out of database to a separate modeling environment Data scientists will SQL already be familiar with this approach, Server so it's something to build on 2017

Remote Execution Context SQL Server 2016/17 Results RevoScaleR & RevoScalePy Parallel Algorithms Iterate/ Sequence Parallel Worker Tasks

Remote Execution Context Key Points: Fast and friendly for existing R/Python users Results a SQL Compute Context SQL Server 2016/17 RevoScaleR & RevoScalePy Parallel Algorithms Limited to what can be done through Requires external script execute permissions Iterate/ Sequence Parallel Worker Tasks

Run R and Python from SQL environments T-SQL Apps T-SQL Script SQL Server 2016/17 Run Python & R From within the Query Processor

T-SQL Apps Key Points: Works with all of R/Python functions for maximum flexibility Most natural for users with some SQL familiarity, but doable for all Run R and Python from SQL environments T-SQL Script SQL Server 2016/17 Run Python & R From within the Query Processor

BI & Reporting; Web apps T-SQL Script Enable smart non-r apps SQL Server 2016/17 T-SQL Stored Procedure

BI & Reporting; Web apps T-SQL Script Enable smart non-r apps Key Points: Helper functions are available so you don't have to manually recode and R/Python script Allows firing via traditional triggers SQL Server 2016/17 T-SQL Stored Procedure

Production Apps T-SQL SQL Server 2017 Events Events Models Stored Proc s and Triggers Real time scoring engine

Production Apps T-SQL Key Points: R/Python need not be installed Works with many of the RevoScale* and MicrosoftML models SQL Server 2017 Events Works on SQL 2017 and Azure SQL DB Models Stored Proc s and Triggers single millisecond Events response times Real time scoring engine

Other Hints and Deployment Considerations

Integration with SQL query execution Parallel query pushing data to multiple external processes / threads Use in-memory technology and Columnstore Indexes alongside your ML scripts Streaming mode execution Stream data in batches to the R/Python process to scale beyond available memory Train and Predict using parallelism Leverage RevoScaleR/revoscalepy and scale your R and Python scripts using multi-threading and parallel processing Native scoring for faster real-time predictions (New in 2017)

No dependency between rows (ex: scoring) Trivial Parallelism exec sp_execute_external_script @language = N'R, @script = N' # unserialize model logitobj <- unserialize(modelbin); # build classification model to predict tipped or not system.time(outputdataset <- data.frame(predict(logitobj, newdata = InputDataSet, type = "response")))[3];, @input_data_1 = N SELECT tipped, passenger_count, trip_time_in_secs, trip_distance, d.direct_distance FROM dbo.nyctaxi_sample TABLESAMPLE (50 PERCENT) REPEATABLE (98074) CROSS APPLY [CalculateDistance](pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as d OPTION(MAXDOP 2) -- Needed only to control DOP, @parallel = 1, @params = N'@modelbin varbinary(max), @r_rowsperread int, @modelbin = @model, @r_rowsperread = 5000; sp_execute_exte rnal_script @script = N Predict, @parallel = 1 (MAXDOP = 2)

Requirements: No dependency between rows (ex: scoring) Key Benefits: Execute script over chunks of data Process data that doesn t fit in memory Can be used from client (rx* function) or server exec sp_execute_external_script @language = N'R, @script = N' # unserialize model logitobj <- unserialize(modelbin); # build classification model to predict tipped or not system.time(outputdataset <- data.frame(predict(logitobj, newdata = InputDataSet, type = "response")))[3];, @input_data_1 = N SELECT tipped, passenger_count, trip_time_in_secs, trip_distance, d.direct_distance FROM dbo.nyctaxi_sample TABLESAMPLE (50 PERCENT) REPEATABLE (98074) CROSS APPLY [CalculateDistance](pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as d, @params = N'@modelbin varbinary(max), @r_rowsperread int, @modelbin = @model, @r_rowsperread = 5000; Dataset = 15000 Rows Sp_execute_external_script @r_rowsperread = 5000 5000 5000 5000

exec sp_execute_external_script @language = N'R, @script = N' # Define the connection string connstr <- paste("driver=sql Server;Server=", instance_name, ";Database=", database_name, ";Trusted_Connection=true;", sep=""); # Set ComputeContext cc <- RxInSqlServer(connectionString = connstr, numtasks = 4); # Pull data from query featuredatasource = RxSqlServerData(sqlQuery = input_query, connectionstring = connstr, computecontext = cc); # Table to write data to, using compute context tippredictions = RxSqlServerData(table = "nyc_taxi_tip_predictions", connectionstring = connstr); # Unserialize model logitobj <- unserialize(modelbin); # Predict tipped or not based on model Predictions -> rxpredict(logitobj, data = featuredatasource, outdata = tippredictions, overwrite = TRUE);, @params = N'@input_query nvarchar(max), @input_query = N'SELECT * FROM nyctaxi_training_sample' sp_execute_ext ernal_script @script = N rxlogit, @input_data_1 = N SELECT. (MAXDOP = 2) rxcall <Model Object> rxcall +BxlServer +BxlServer m 1 + m 2

@model SELECT native_model FROM models WHERE model_name = 'Fraud Detection Model PREDICT MODEL = @model DATA = new_transaction

-- Check/set External Resource Pool config SELECT * FROM sys.resource_governor_resource_pools WHERE name = 'default' SELECT * FROM sys.resource_governor_external_resource_pools WHERE name = 'default' ALTER RESOURCE POOL "default" WITH (max_memory_percent = 60); ALTER EXTERNAL RESOURCE POOL "default" WITH (max_memory_percent = 80); ALTER RESOURCE GOVERNOR RECONFIGURE; -- enforce changes

DMV sys.dm_exec_requests sys.dm_external_script_requests sys.dm_external_script_execution_stats sys.dm_os_performance_counters Description New column: external_script_request_id Returns running external scripts, DOP & assigned user account Number of executions for rx* functions in RevoScaleR package New External Scripts performance counters

here

https://docs.microsoft.com/en-us/sql/advanced-analytics/r/new-components-in-sql-server-to-support-r

Reduced surface area and isolation external scripts enabled required R/Python script execution outside of SQL Server process space Script execution requires explicit permission sp_execute_external_script requires EXECUTE ' ANY EXTERNAL SCRIPT for nonadmins SQL Server login/user required and db/table access R/Python processes have limited privileges R/Python processes run under local user accounts in the SQLRUserGroup Each execution is isolated. Different users with different accounts Windows firewall rules to block outbound traffic

Examples using R: sqlpackages <- rxinstalledpackages(fields = c("package", "Version", "Built"), computecontext = sqlserver) pkgs <- c("ggplot2") rxinstallpackages(pkgs = pkgs, verbose = TRUE, scope = "private", computecontext = sqlserver) Example using T-SQL: EXEC sp_execute_external_script @language=n'r', @script=n' mypackages <- rxinstalledpackages(); OutputDataSet <- as.data.frame(mypackages); ' pkgs <- c("ggplot2") rxremovepackages(pkgs = pkgs, verbose = TRUE, scope = "private", computecontext = sqlserver)

Azure SQL Database R support Python support Machine Learning Services in SQL Server on Linux Additional algorithms and pre-trained models Native Scoring for more models

R Services ML Services AKA.MS/MLSQLDEV SSMS Reports for ML Services ML cheat sheet Hospital length of Stay demo scripts SQL Server Machine Learning Services

Muchas Gracias! mhelb@microsoft.com

Learning and Scoring Process Learning Labels Images Featurization (using pre-trained ResNet18 neural network model) Features Classification Algorithm (Boosted Tree) Classifier Model Scoring Images Featurization (using pre-trained ResNet18 neural network model) Features Classification Predictions

Distributed Featurization and Training On HD-Insight SQL Server Models Table HDInsight-MRS Azure Blob Storage CT Scan Images Classifier Training Featurization Edge Distributed Featurization

Scoring with Deep Learning Model in SQL SQL Server Web App Stored Procedures with R Code Featurization Scoring with the classifier model Stored Procedure call Model table, Features table, New Images table Diagnosis: 35% certainty

Image Featurization

Parallel Featurization (30x speedup)

Training on Spark and storing in SQL

Scoring in SQL