Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Similar documents
Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Python With Data Science

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

Data Science. Data Analyst. Data Scientist. Data Architect

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Oracle Machine Learning Notebook

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Data Science Bootcamp Curriculum. NYC Data Science Academy

Build a system health check for Db2 using IBM Machine Learning for z/os

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Oracle9i Data Mining. Data Sheet August 2002

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Introduction to Data Mining and Data Analytics

Machine Learning in the Process Industry. Anders Hedlund Analytics Specialist

Applying Supervised Learning

Boost your Analytics with ML for SQL Nerds

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

NVIDIA DEEP LEARNING INSTITUTE

Boost your Analytics with Machine Learning for SQL Nerds. Julie mssqlgirl.com

Deploying Machine Learning Models in Practice

Data Analyst Nanodegree Syllabus

Machine Learning using MapReduce

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Data Driven Performance Repository to Classify and Retrieve Storage Tuning Profiles Ismael Solis Moreno IBM

Interpretable Machine Learning with Applications to Banking

CS 8520: Artificial Intelligence

Pre-Requisites: CS2510. NU Core Designations: AD

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

BIG DATA SCIENTIST Certification. Big Data Scientist

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CSE 158. Web Mining and Recommender Systems. Midterm recap

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Deploying, Managing and Reusing R Models in an Enterprise Environment

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Unsupervised Learning

Automation.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

SIR C R REDDY COLLEGE OF ENGINEERING

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Why data science is the new frontier in software development

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

SOFTWARE DEVELOPMENT: DATA SCIENCE

Lecture 25: Review I

Populating the Galaxy Zoo

Data: a collection of numbers or facts that require further processing before they are meaningful

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Python Certification Training

COMS 4771 Clustering. Nakul Verma

Lecture on Modeling Tools for Clustering & Regression

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

Rapid growth of massive datasets

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Machine Learning with Python

Classifying Depositional Environments in Satellite Images

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Facial Expression Classification with Random Filters Feature Extraction

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Machine Learning Software ROOT/TMVA

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Artificial Intelligence. Programming Styles

9. Conclusions. 9.1 Definition KDD

Oracle Big Data Discovery

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

Contents. Preface to the Second Edition

Voice, Image, Video : AI in action with AWS. 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Getting Started with Advanced Analytics in Finance, Marketing, and Operations

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Data Analytics Training Program

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

ABSTRACT I. INTRODUCTION. Dr. J P Patra 1, Ajay Singh Thakur 2, Amit Jain 2. Professor, Department of CSE SSIPMT, CSVTU, Raipur, Chhattisgarh, India

Data and AI LATAM 2018

Evaluation of Machine Learning Algorithms for Satellite Operations Support

MACHINE LEARNING Example: Google search

Oracle9i Data Mining. An Oracle White Paper December 2001

Spectral Classification

CS229 Final Project: Predicting Expected Response Times

Analytics and Visualization

SAS (Statistical Analysis Software/System)

Practical Machine Learning Agenda

Intro to Artificial Intelligence

DATA MINING TRANSACTION

Question Bank. 4) It is the source of information later delivered to data marts.

Data Analyst Nanodegree Syllabus

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Performance Evaluation of Various Classification Algorithms

Data Science Course Content

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Data Mining: Models and Methods

COPYRIGHT DATASHEET

APPLICATION OF SOFTMAX REGRESSION AND ITS VALIDATION FOR SPECTRAL-BASED LAND COVER MAPPING

Jarek Szlichta

CS6220: DATA MINING TECHNIQUES

Transcription:

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of this document are UNCONTROLLED. Verify that this is the correct version before use.

Using SQL Server 2017 Machine Learning Services Implementing Common Machine Learning Tasks Jon Tupitza Practice Director, Data Platform & Predictive Analytics 2

Take-Aways Understand Machine Learning and SQL Server ML Services Recognize the Categories of Machine Learning Algorithms along with Practical Applications Learn the Different Ways and Means to Implement SQL Server Machine Learning Services 3

Agenda Intro to Machine Learning SQL Server 2017 ML Services Value Proposition Getting Started Machine Learning Tasks Supervised vs. Unsupervised Classification Regression Clustering Principal Component Analysis 4

Artificial Intelligence: The Origins of Machine Learning Early efforts at AI involved Defining Explicit Rules for Processing Data Machine Learning involves Inferring Rules from the Structure of Data Deep Learning extends Machine Learning to Hyper-Parametric Data Learns from Successive Layers of Increasingly Meaningful Representations Mimics Neurons in the Human Brain Symbolic AI (1950 1980 s) Machine Learning (1990 s +) Deep Learning (2010+) Artificial Intelligence Machine Learning The advancement of Artificial Intelligence has been made possible only through the Data Explosion of the Internet and the increasing power of modern computers Deep Learning 5

Machine Learning: Automates Problem-Solving Using computational power to detect and exploit patterns in data Rather than trying to solve a problem, train a computer to solve the problem for you Fitting a model enables the system to predict outcomes by observing examples Well fit models generalize well. They produce reliable predictions when exposed to new data; not over-fitted! Spam & Fraud Detection Decision Engines: Approval or Disapproval Recommendation Engines: Which one to choose Predictive Maintenance: When will it break? Image Analysis: Classification & Object Detection Text Analysis: Term & Document Comprehension Natural Language Processing & Machine Translation 6

The Team Data Science Process Largely Heuristic! Based on Conducting Experiments (i.e., Scientific Method) Business Understanding Business Understanding Identify the Problem Domain Identify the Solution Scenario Deployment Data Acquisition & Understanding Acquire & Understand Data Develop Machine Learning Models Load, Prepare & Explore Data Identify Impactful Features Select & Engineer Features Train, Evaluate & Tune Models Modeling Deployment Publish Models as Webservices Consume Models Visually and Programmatically 7

Agenda Intro to Machine Learning SQL Server 2017 ML Services Value Proposition Getting Started Machine Learning Tasks Supervised vs. Unsupervised Classification Regression Clustering Principal Component Analysis 8

SQL Server ML Services: Value Proposition The First Commercial Database Server with Built-In Artificial Intelligence Enables Developers to Train, Evaluate and Deploy Machine Learning Models Inside of SQL Server Databases for Enterprise Production Overcomes Some Major Limitations Inherent to Statistical Software System Memory has been limited to the capacity of client workstations Data Movement has been saturating networks between remote storage and the development environment Performance and Scale have been limited by a lack of multi-threading and parallel processing capabilities Provides a Convenient Way to Operationalize Machine Learning Access ML Algorithms using familiar T-SQL stored procedures Manage Machine Learning Models in SQL Server database tables Store Predictive Outcomes in SQL Server database tables Leverage database mechanisms like security, governance and monitoring 9

SQL Server 2017 Machine Learning Services Adds Python Support: Grants access to deep learning tools like CNTK, TensorFlow and Keras. RevoScalePy in addition to RevoScaleR: Enables scaling ML to arbitrarily large datasets; beyond available memory. Linear & Logistic Regression, Decision Trees, Boosted Trees, Random Forests. API s for ETL processing, remote compute contexts and data sources. MicrosoftML: Simplifies complex AI scenarios involving Image and Text data sources Features Deep Neural Networks, SVM, Fast Tree, Forests, etc. Includes pre-trained models like ResNet for Image and Sentiment analysis Python Remote Compute in SQL Server: Enables data scientists to execute Python code remotely from their development machines to explore data and develop models without moving data around. 10

Where to Process: Transact-SQL versus R or Python Transact-SQL Table Joins and Filters Set Operations (Union, Except, Intersect) Sorts & Aggregations R or Python Statistical Calculations Predictive Algorithms Data Visualizations 11

Demonstration Getting Started with SQL Server 2017 Machine Learning Services 12

Agenda Intro to Machine Learning SQL Server 2017 ML Services Advantage and Value Getting Started Machine Learning Tasks Supervised vs. Unsupervised Classification Regression Clustering Principal Component Analysis 13

Machine Learning: Categories and Algorithms Supervised Learning: Classification and Regression Learn by historic example to predict the future Requires both Features and a Label (Observations where the answers are known) Train: Give the computer samples from which to learn Test: Give the computer different samples from which to assess accuracy Evaluate: Analyze test results to determine accuracy of the model Unsupervised Learning: Clustering (K-Means & Hierarchical) Understanding the past by gaining insight from historic data Involves Feature-sets having no Labels (Observations where answers aren t known) Examine data to reveal its intrinsic [but hidden] structure Make recommendations by grouping people, things, or events together 14

Supervised Learning: Classification Requires both Features and a Label (where the answers are known) Identify features most collinear to the label; eliminating those that aren t Produces discrete-value outcomes; i.e., categorical values Binomial (2-class) Classification: Boolean Decisions; e.g., Loan approvals Multinomial (Multi-class) Classification: Concatenate multiple binomial classifiers: Combinations and Permutations Learning from a finite set of values Determines into which category each observation should be classified: Based on performance criteria, which grade should be awarded Based on attributes, which type is it? Is it an Animal, Vegetable or Mineral? 15

Demonstrations Classification: Exploratory Data Analysis using Jupyter Notebooks with the RevoScalePy library s Remote Execution API s Deploy a Machine Learning Model to Predict IRIS Species using In-Database Scripting 16

Supervised Learning: Regression Requires both Features and a Label (where the answers are known) For determining if a correlation exists between 2 or more variables: Predictor (X); e.g., # Incidents Response (Y); e.g., # Officers Produces continuous-value outcomes; i.e., real numbers Linear Regression produces continuous values: e.g., can be any numeric value Logistic Regression produces discrete values: e.g., must be a digit between 0 9 17

Demonstration Linear Regression: Train and Evaluate a Model using Jupyter Notebooks with the RevoScalePy library s Remote Execution API s Operationalize a Machine Learning Model that Predicts Future Ski Rentals using In-Database Script Execution 18

Unsupervised Learning: K-Means Clustering Used where Feature-sets have No Labels (the answers aren t known) Appropriate for Separating Overlapping Spheroid Clusters This Algorithm Minimizes the Sum of Distances from Clusters Centers: 1) The hidden underlying structures in the data are discovered 2) Based on that structure, each of the observations (data points) are iteratively moved closer to conceptual center-points that represent the centers of k groups. 3) The centers are then moved to optimize the centrality of their location within each group 4) Points are moved closer to the closest center-point 5) The entire process is iterated until the sum of the distances from their centers is minimized 19

Unsupervised Learning: Hierarchical Clustering Used where Feature-sets have No Labels (the answers aren t known) Groups Observations by Association using Like Attributes Documents or Emails by Topic: Detect cyber-crime, perform research People by their Behaviors: Categorize Customers, Identify Criminals Entities by their Properties: Animal, Vegetable or Mineral? Can be used to Identify Which Attributes are the Most Influential Appropriate for Separating Linear Clusters 1) Places each observation into its own cluster 2) Merge the clusters having the closest two points 3) Continue until all clusters belong to closest groups 4) You Must Stop before it becomes a single cluster 20

Demonstration K-Means Clustering: Develop a ML Experiment using Jupyter Notebooks with the RevoScalePy library s Remote Execution API s Operationalize a Model that Clusters Customers by their Purchasing Habits using In-Database Script Execution 21

Unsupervised Learning: Principal Component Analysis Descriptive Analytic Method: Being a form of Unsupervised Learning, it serves to categorize observations in the absence of a (dependent) predictor variable Addresses Non-linear Correlations Summarizes the influence of non-linear relationships Extracts variability from correlated features into powerful components Reduces Dimensionality Simplifies data analysis Transforms high-dimensional data into a smaller collection of more influential components 22

Demonstration Principal Component Analysis (PCA): Perform Dimension Reduction using Principal Component Analysis in Jupyter Notebooks with the RevoScalePy library s Remote Execution API s 23

Questions 24

Resources Microsoft Docs: Tutorials for SQL Server Machine Learning Services Microsoft Machine Learning Server Blog: Basics of R and Python Execution in SQL Server 25