SQL Server Machine Learning Marek Chmel & Vladimir Muzny @VladimirMuzny & @MarekChmel MCTs, MVPs, MCSEs Data Enthusiasts! vladimir@datascienceteam.cz marek@datascienceteam.cz
Session Agenda Machine learning and Data Science SQL 2017 Machine learning architecture Using R for Machine Learning with SQL Server Using Python for Machine Learning with SQL Server
Machine Learning Introduction Predict properties of new data by learning from a sample Predict sales of stores in a region based on historical sales Predict probability of fraud on a new credit card transaction Predict default of a new loan based on loan / transaction history Predict sentiment of a new tweet or review Classify new image(s) based on sample images & attributes Classify data into groups or clusters Popular ML technologies R & Python
Advanced analytics, or data science or artificial intelligence?
Machine learning / data mining algorithms
Data science more than data engineering
Main Differences DS vs. BI
Data Science & Machine Learning Roles Data Scientist A highly educated and skilled person who can solve complex data problems by employing deep expertise in scientific disciplines (mathematics, statistics or computer science) Data Professional A skilled person who creates or maintains data systems, data solutions or implements predictive modelling. Roles: Database Administrator, Database Developer, or BI Developer Software Developer A skilled person who designs and develops programming logic, and can apply machine learning to integrate predictive functionality into applications
Machine Learning Challenges
Real World Applications
Microsoft Rs Microsoft R Open Microsoft R Open in Azure ML Microsoft R Client Microsoft R Server...for HDInsight, for Hadoop, for Linux (SUSE, Red Hat/CentOS) Microsoft SQL Server 2016 R Services on-prem and for Azure SQL Database (preview) Microsoft SQL Server 2017 Machine Learning Services Microsoft Machine Learning Server
Python Fewer statistics/ml packages, but becoming just enough Great as glue: orchestration and scripting Key data science libraries numpy & scipy (numeric processing and stats) Nowhere near as vast as R in scope pandas (data frames) matplotlib and ggplot2 (charts) scikit-learn (mining) microsoftml* and revoscalepy*
Machine Learning Services History 2015 Microsoft acquires Revolution Analytics 2016 SQL Server R Services 2017 SQL Server Machine Learning Services
Machine Learning Architecture Extensibility framework create a better interface between SQL Server and data science languages such as R and Python reduce the friction that occurs when data science solutions are moved into production protect data that might be exposed during the data science development process Executing a trusted scripting language within a secure framework database developer can maintain security while allowing data scientists to use enterprise data SQL 2016 Extensibility Framework R Support (3.2.2) Microsoft R Server SQL Server 2017 Python Support (3.5.2) R Support (3.3.3) Native Scoring using PREDICT In -database Package Management
Architecture core concepts Multi-process architecture Full interoperability with open source R and Python R and Python can function independently on SQL Server Microsoft provides a set of proprietary libraries that provide integration with SQL Server Security support for both integrated Windows authentication and password-based SQL logins SQL Server Trusted Launchpad to manage external script execution Scalability and performance resource governance and parallel processing using SQL Server distributed computing provided by the algorithms in RevoScaleR and revoscalepy.
R Language Architecture RevoScaleR. Includes a variety of APIs for data manipulation and analysis. The APIs have been optimized to analyze data sets that are too big to fit in memory and to perform computations distributed over several cores or processors. RevoPemaR - Parallel External Memory Algorithm, developing own parallel algorithms
Python and SQL Server revoscalepy is a new library provided by Microsoft to support distributed computing, remote compute contexts, and high-performance algorithms for Python. It is based on the RevoScaleR package for R, which was provided in Microsoft R Server and SQL Server R Services, and aims to provide the same functionality: Supports multiple compute contexts, both remote and local Provides functions equivalent to those in RevoScaleR for data transformation and visualization Provides Python versions of RevoScaleR machine learning algorithms for distributed or parallel processing Improved performance, including use of the Intel math libraries
Best Practices: Resources Memory is a key constraint for R / Python scripts Use sys.dm_resource_governor_external_resource_pools DMV with a test workload Leverage Resource Governance to isolate SQL & external scripts New EXTERNAL RESOURCE POOL object Leverage Always On Secondaries to offload external script execution
Best Practices: Operationalization Secure out-of-the box defaults Some lift-n-shift scripts may not work. Ex: installing packages or reaching out to external resources Leverage SQL Server data integration capabilities Ex: DQ to pull data from other sources, SSIS, external tables Leverage SQL query processing integration Batch mode execution on Columnstore data Parallel execution for training (rx* functions) and scoring Streaming execution of external scripts
Python in SQL Server 2017 Anaconda distribution Distribution of Python focused on Data Science Package and environment manager Installs with more than 100 packages Python version 3.5.2 Jupyter notebooks
R in SQL Server 2017 Best in class scientific language Numerous packages availiable R 3.3.3 Rstudio and external connectivity
Popular data science packages NumPy N-dimensional arrays, random numbers Pandas data manipulation, DataFrame object SciPy scientific computing and statistical methods Scikit-learn machine learning Matplotlib plotting and graphics
DEMO Machine Learning