Azure Data Lake Analytics Introduction for SQL Family Julie Koesmarno @MsSQLGirl www.mssqlgirl.com jukoesma@microsoft.com
What we have is a data glut Vernor Vinge (Emeritus Professor of Mathematics at San Diego State University)
The Data Lake Approach
CLOUD MOBILE
Traditional data warehousing approach Understand Corporate Strategy Gather Requirements Business Requirements Implement Data Warehouse Reporting & Analytics Design Reporting & Analytics Development BI and analytics Data warehouse Dimension Modelling Physical Design ETL Technical Requirements ETL Design ETL Development Data sources Setup Infrastructure Install and Tune
The Data Lake approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Devices Batch queries Interactive queries Real-time analytics Machine Learning Data warehouse
How Microsoft has used Big Data MICROSOFT DOUBLES SEARCH SHARE We needed to better leverage data and analytics to win in search We changed our approach More experiments by more people! 25% 20% 15% 15% 16% 18% 19% 20% 10% 9% 11% So we Built an Exabyte-scale data lake for everyone to put their data. Built tools approachable by any developer. Built machine learning tools for collaborating across large experiment models. 5% 0% 2009 2010 2011 2012 2013 2014 2015 Source: ComScore 2009-2015 Search Report US
Introducing Azure Data Lake Big Data Made Easy
Azure Data Lake as part of Cortana Intelligence Suite Data Sources Information Management Big Data Stores Machine Learning and Analytics Intelligence People Data Factory SQL Data Warehouse Machine Learning Cognitive Services Data Catalog Data Lake Analytics Bot Framework Web Apps Event Hubs Data Lake Store HDInsight (Hadoop and Spark) Cortana Apps Mobile Stream Analytics Bots Dashboards & Visualizations Sensors and devices Power BI Automated Systems Data Intelligence Action
Azure Data Lake Analytics
Why ADLA? Use Cases Digital Crime Unit Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing Large-scale image feature extraction and classification using custom code Shopping Recommendation Complex pattern analysis and prediction over shopping records using proprietary algorithms
Why ADLA? ADLA Enables customers to leverage existing experience with C#, SQL & PowerShell Offers convenience, efficiency, automatic scale, and management in a job service form factor
14 Azure Data Lake Analytics Start in seconds Scale instantly Pay per job Develop massively parallel programs with simplicity Debug and optimize your Big Data programs with ease Virtualize your analytics Enterprise-grade security, auditing and support 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100
ADL and SQL / Power BI
Work across all cloud data Azure Data Lake Analytics Azure SQL DW Azure SQL DB Azure Data Lake Store Azure Storage Blobs SQL DB in an Azure VM
Tools
Get started 1 2 3 4 Log in to Azure Create an ADLA account Write and submit an ADLA job with U-SQL The job reads and writes data from storage 30 seconds ADLS Azure Blobs Azure DB
https://github.com/azure/usql
20 What can you do in the Azure Portal? Create a new Data Lake Analytics account Author U-SQL scripts Submit U-SQL jobs Cancel running jobs Provision users who can submit jobs Visualize usage stats (compute hours) Visualize job management chart
ADLA billing https://blogs.msdn.microsoft.com/azuredatalake/2016/10/12/understanding-adl-analytics-unit/ Accounts are FREE! Pay for the compute resources you want for your queries Pay for storage separately (query_hours * parallelism) * price/hour USAGE GA PRICE (STARTING JANUARY 1 ST, 2017)* ADLAU $2 / hour Completed Job Free *special monthly commitment discounted pricing available
ADLAU allocation Example: allocating 10 ADLAUs for a 10 minute job Cost: 10 min * 10 ADLAUs = 100 ADLAU minutes Blue line: allocated Red line: running Over-allocation Under-allocation Time Consider using fewer ADLAUs You are paying for the area under the blue line You are only using the area under the red line Time Consider using more ADLAUs
23 What can you do with Visual Studio? Author U-SQL scripts (with C# code) Debug U-SQL and C# code Submit and cancel U-SQL Jobs Visualize physical plan of U-SQL query Visualize and replay progress of job Fine-tune query performance Create metadata objects Browse metadata catalog
25 How to get going with ADL Tools for Visual Studio Plug-in
Metadata objects ADL Analytics creates and stores a set of metadata objects in a catalog maintained by a metadata service Tables and TVFs are created by DDL statements (CREATE TABLE ) Metadata objects can be created directly through the Server Explorer Azure Data Lake Analytics account Databases Tables Table valued functions Jobs Schemas Linked storage 26
27 Metadata catalog The metadata catalog can be browsed with the Visual Studio Server Explorer Server Explorer lets you: 1. Create new tables, schemas and databases 2. Register assemblies
Meta Data Object Model Credentials Data Source ADLA Account/Catalog [1,n] Database [1,n] Schema C# Fns C# UDTs C# UDAgg C# Assemblies C# Extractors C# Reducers C# Processors C# Applier C# Combiners C# Outputters [0,n] Ext. tables tables views TVFs Procedures Table Types Statistics Clustered Index Legend partitions User objects Contains MD Name Refers to C# Name Implemented and named by
I U-SQL
Status Quo: SQL for Big Data Declarativity does scaling and parallelization for you Extensibility is bolted on and not native hard to work with anything other than structured data difficult to extend with custom code
Status Quo: Programming Languages for Big Data Extensibility through custom code is native Declarativity is bolted on and not native User often has to care about scale and performance SQL is 2 nd class within string Often no code reuse / sharing across queries
Declarativity and Extensibility are equally native to the language! Why U-SQL Get benefits of both! Makes it easy for you by unifying: Unstructured and structured data processing Declarative SQL and custom imperative Code (C#) Local and remote Queries Increase productivity and agility from Day 1 and at Day 100 for YOU!
The Origins of U-SQL Next generation large-scale data processing language combining U-SQL SCOPE The declarative, optimizable and parallelizability of SQL The extensibility, expressiveness and familiarity of C# T-SQL Hive High performance Scalable Affordable Easy to program Secure 33
Query data where it lives Easily query data in multiple Azure data stores without moving it to a single store Benefits Avoid moving large amounts of data across the network between stores Single view of data irrespective of physical location Minimize data proliferation issues caused by maintaining multiple copies Single query language for all data Each data store maintains its own sovereignty Design choices based on the need Push SQL expressions to remote SQL sources Projections Filters Joins U-SQL Query Azure Data Lake Storage Azure Data Lake Analytics Query Azure Storage Blobs Azure SQL Data Warehouse Azure SQL DB Azure SQL in VMs
U-SQL Language Philosophy Declarative Query and Transformation Language: Uses SQL s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions Optimizable, Scalable Expression-flow programming style: Easy to use functional lambda composition Composable, globally optimizable Operates on Unstructured & Structured Data Schema on read over files Relational metadata objects (e.g. database, table) Extensible from ground up: Type system is based on C# Expression language IS C# User-defined functions (U-SQL and C#) User-defined Aggregators (C#) User-defined Operators (UDO) (C#) REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime, last_order DateTime, order_count int, order_amount float,... ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt, AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.startswith("new") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j; U-SQL provides the Parallelization and Scale-out Framework for Usercode EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources
U-SQL compilation process Compilation output (in job folder) U-SQL metadata service C# managed dll C++ unmanaged dll compiler & optimizer algebra deployed to vertices other files (system files, deployed resources)
Query Life Visual Studio Front-End Service Job Scheduler & Queue Compiler Optimizer Runtime Vertex Scheduling Portal / API
Jobs States, queue, priority
39 Job execution graph After a job is submitted the progress of the execution of the job as it goes through the different stages is shown and updated continuously Important stats about the job are also displayed and updated continuously
Visual Studio: Job states UX Preparing Job State New Compiling The script is being compiled by the Compiler Service Queued Queued All jobs enter the queue. Scheduling Are there enough ADLAUs to start the job? Starting If yes, then allocate those ADLAUs for the job Running Finalizing Ended (Succeeded, Failed, Cancelled) Running The U-SQL runtime is now executing the code on 1 or more ADLAUs or finalizing the outputs Ended The job has concluded.
Why does a job get queued? Local cause Possible condition: Not enough containers available to your account Global cause (very rare) Possible conditions: System-wide shortage of containers System-wide shortage of bandwidth
ADLA & U-SQL Summary
This is why ADL & U-SQL! Easily processes unknown value big data Unifies natively SQL s declarativity and C# s extensibility Enterprise grade security & auditing support Increases productivity and agility from Day 1 forward for YOU! Sign up for an Azure Data Lake account http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
Additional Resources Blogs and community page: http://usql.io (U-SQL Github) http://blogs.msdn.microsoft.com/mrys/ http://blogs.msdn.microsoft.com/azuredatalake/ https://channel9.msdn.com/search?term=u- SQL#ch9Search Documentation and articles: http://aka.ms/usql_reference https://azure.microsoft.com/enus/documentation/services/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251 ADL forums and feedback http://aka.ms/adlfeedback https://social.msdn.microsoft.com/forums/azure/en- US/home?forum=AzureDataLake http://stackoverflow.com/questions/tagged/u-sql
Get started today! For more information visit: http://azure.com/datalake 46
Thank You Redmond!