SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Similar documents
Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Hadoop Stack

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Microsoft Big Data and Hadoop

Stages of Data Processing

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Big Data with Hadoop Ecosystem

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop. Introduction / Overview

Ian Choy. Technology Solutions Professional

microsoft

Microsoft Analytics Platform System (APS)

Hadoop An Overview. - Socrates CCDH

Innovatus Technologies

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

BIG DATA COURSE CONTENT

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data Architect.

Exam Questions

SpagoBI and Talend jointly support Big Data scenarios

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Modern Data Warehouse The New Approach to Azure BI

Cmprssd Intrduction To

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

New Approaches to Big Data Processing and Analytics

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Configuring and Deploying Hadoop Cluster Deployment Templates

HDInsight > Hadoop. October 12, 2017

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Hadoop & Big Data Analytics Complete Practical & Real-time Training

International Journal of Advance Engineering and Research Development. A study based on Cloudera's distribution of Hadoop technologies for big data"

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Oracle Big Data Fundamentals Ed 2

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Talend Open Studio for Big Data. Getting Started Guide 5.3.2

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Processing Big Data with Hadoop in Azure HDInsight

Big Data Analytics using Apache Hadoop and Spark with Scala

Index. Scott Klein 2017 S. Klein, IoT Solutions in Microsoft s Azure IoT Suite, DOI /

Bull Fast Track/PDW and Big Data

Big Data Hadoop Course Content

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

IT directors, CIO s, IT Managers, BI Managers, data warehousing professionals, data scientists, enterprise architects, data architects

Data Architectures in Azure for Analytics & Big Data

Přehled novinek v SQL Server 2016

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Swimming in the Data Lake. Presented by Warner Chaves Moderated by Sander Stad

Hadoop, Yarn and Beyond

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

<Insert Picture Here> Introduction to Big Data Technology

The Technology of the Business Data Lake. Appendix

Importing and Exporting Data Between Hadoop and MySQL

Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago

How to Run the Big Data Management Utility Update for 10.1

Hadoop Online Training

Oracle Big Data Fundamentals Ed 1

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Hortonworks Data Platform

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

MapR Enterprise Hadoop

Acquiring Big Data to Realize Business Value

Processing Big Data with Hadoop in Azure HDInsight

April Copyright 2013 Cloudera Inc. All rights reserved.

Hadoop course content

INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)

Processing Big Data with Hadoop in Azure HDInsight

New Features and Enhancements in Big Data Management 10.2

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Polybase In Action. Kevin Feasel Engineering Manager, Predictive Analytics ChannelAdvisor #ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Overview of Data Services and Streaming Data Solution with Azure

Databases 2 (VU) ( / )

Introduction to BigData, Hadoop:-

Talend Open Studio for Big Data. Getting Started Guide 5.4.0

Copyright 2015 EMC Corporation. All rights reserved. A long time ago

Certified Big Data and Hadoop Course Curriculum

Cloud Computing & Visualization

Talend Open Studio for Big Data. Getting Started Guide 5.4.2

Webinar Series TMIP VISION

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

R Language for the SQL Server DBA

New Technologies for Data Management

17/05/2017. What we ll cover. Who is Greg? Why PaaS and SaaS? What we re not discussing: IaaS

Data Lake Based Systems that Work

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

docs.hortonworks.com

Transcription:

Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and Evangelism Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair Visual Studio Live! Redmond Review columnist for Visual Studio Magazine Twitter: @andrewbrust

Andrew s New/Old Blog (bit.ly/bigondata) Read all about it!

What is Big Data? 100s of TB into PB and higher Involving data from: financial data, sensors, web logs, social media, etc. Parallel processing often involved Hadoop is emblematic, but other technologies are Big Data too Processing of data sets too large for transactional databases Analyzing interactions, rather than transactions The three V s: Volume, Velocity, Variety Big Data tech sometimes imposed on small data problems What s MapReduce? Big data input accepted in file form Data is partitioned and sent to mappers (nodes in cluster) Mappers pre-process data into KV pairs, then all output for (a) given key(s) goes to a reducer Reducers aggregate; one line of output per unique key, with one value Map and Reduce code natively written as Java functions

MapReduce, in a Diagram mapper K 1 mapper reducer mapper mapper K 2 K 3 reducer reducer mapper mapper HDFS File system whose data gets distributed over commodity drives on commodity servers Data is replicated If one box goes down, no data lost Shared Nothing Except the name node BUT: Immutable Files can only be written to once So updates require drop + re-write (slow) You can append though Like a DVD/CD-ROM

HBase A Wide-Column Store, NoSQL database Modeled after Google BigTable HBase tables are HDFS files Therefore, Hadoop-compatible Hadoop often used with HBase But you can use either without the other HBase now available on HDInsight Implemented as different cluster type The Hadoop Stack Log file integration Machine Learning/Data Mining RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFS

Hadoop Distributions Cloudera (CDH) MapR Network File System replaces HDFS Hortonworks (HDP) Open Data Platform (ODP) Pivotal HD Greenplum IP; full dev stack IBM InfoSphere BigInsights HDFS<->DB2 integration And Microsoft Microsoft HDInsight Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows Windows Azure HDInsight and Microsoft HDInsight Server Single node preview runs on Windows client Also Hortonworks HDP for Windows Also HDInsight with Analytics Platform System Includes ODBC Drivers for Hive All contributed back to open source Apache project

HDInsight Recent Changes YARN and Tez now available MapReduce no longer mandatory Access via PowerShell and HDInsight cmdlets Need to install PowerShell for Microsoft Azure and HDInsight Web GUI For Hive queries and job monitoring Azure HDInsight Provisioning Go to Windows Azure portal Select HDInsight from left navbar Click + NEW button @ lower-left Specify cluster name, number of nodes, admin password, storage account Credentials used for ODBC Optionally, enable RDP access to head node, with credentials Click CREATE HDINSIGHT CLUSTER Wait for provisioning to complete Use PowerShell or RDP into clustername.azurehdinsight.net

Azure HDInsight Provisioning Working with HDInsight Web GUI For Hive queries and job monitoring Access via PowerShell and HDInsight cmdlets Need to install PowerShell for Microsoft Azure and HDInsight RDP into head node To clustername.azurehdinsight.net Work from (remote) Windows command prompt

Submitting, Running and Monitoring Jobs Upload a JAR Use Streaming Use other languages (i.e. other than Java) to write MapReduce code Python is popular option Any executable works, even C# console apps On HDInsight, JavaScript works too Still uses a JAR file: streaming.jar Run at command line (PowerShell or Command window via RDP) passing JAR name and params Amenities for Visual Studio/.NET.NET SDK for Hadoop Visual Studio Hadoop Tools for HDInsight HDInsight PowerShell Cmdlets Hortonworks Data Platform for Windows LINQ to Hive OdbcClient + Hive ODBC Driver HDInsight Emulator

Running MapReduce Jobs Hive Used by most BI products which connect to Hadoop Provides a SQL-like abstraction over Hadoop Officially HiveQL, or HQL Works on own tables, but also on HBase Query generates MapReduce job, output of which becomes result set Microsoft has Hive ODBC driver Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only) New! SET hive.execution.engine = Tez

Hive The Data-Refinery Idea Use Hadoop to on-board unstructured data, then extract manageable subsets Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine This is the current rationalization of Hadoop + BI tools coexistence Will it stay this way?

HDInsight on Linux Currently in preview Allows interaction with full ecosystem of Linux-based Hadoop tools Connect via Apache Ambari or SSH/PuTTY Still maps HDFS onto Azure storage HDInsight Script Steps Allows automated customization of an HDInsight cluster Uses PowerShell to install additional components Pre-built scripts exist for components like Apache Spark But you can write your own AWS Elastic MapReduce has a very similar feature

Just-in-time Schema When looking at unstructured data, schema is imposed at query time Schema is context specific If scanning a book, are the values words, lines, or pages? Are notes a single field, or is each word a value? Are date and time two fields or one? Are street, city, state, zip separate or one value? Pig and Hive let you determine this at query time So does the Map function in MapReduce code How Does MS BI Fit In? Excel, PowerPivot: can query via Hive ODBC driver Analysis Services (SSAS) Tabular Mode Also compatible with Hive ODBC Driver Multidimensional mode is not Power Query Connectivity to HDInsight and standard HDFS Power View Works against PowerPivot and SSAS Tabular RDBMS + Parallel Data Warehouse (PDW) Sqoop connectors Columnstore Indexes Enterprise Edition and PDW only APS/PDW: PolyBase and HDInsight Region

Excel, PowerPivot Excel and PowerPivot use the BI Semantic Model (BISM), which can query Hadoop via Hive and its ODBC driver Excel also features Power Query, which can query HDFS directly and insert the results into a BISM repository Excel BISM accommodates millions of rows through compression. Not petabyte scale, but sufficient to store and analyze output of Hadoop queries. PowerPivot, SSAS Tabular SQL Server Analysis Services Tabular mode is the enterprise server implementation of BISM Features partitioning and role-based security Can store billions of rows. So even better for Hadoop output analysis. Excel-based BISM repositories can be upsized to SSAS Tabular

Querying Hadoop from Microsoft BI Sqoop Acronym for SQL to Hadoop Essentially a technology for moving data between data warehouses and Hadoop Command line utility; allows specification of source/target HDFS file and relational server, database and table Sqoop connectors available for SQL Server and PDW Sqoop generates MapReduce job to extract data from, or insert data into, HDFS

APS/PDW, PolyBase Microsoft Analytics Platform System includes SQL Server Parallel Data Warehouse (PDW) Massively Parallel Processing (MPP) data warehouse appliance version of SQL Server MPP manages a grid of relational database servers for divide-and-conquer processing of large data sets. PDW and SQL 2016 include PolyBase, a component which allows PDW to query data in Hadoop directly. Bypasses MapReduce; addresses data nodes directly and orchestrates parallelism itself APS available with or without a built-in, PolyBaseconfigured HDInsight region PolyBase Versus Hive, Sqoop Hive and Sqoop generate MapReduce jobs, and work in batch mode PolyBase addresses HDFS data itself This is true SQL over Hadoop. Competitors: Cloudera Impala Teradata QueryGrid Pivotal HAWQ

Resources Apache Hadoop home page http://hadoop.apache.org/ Hive home page http://hive.apache.org/ Azure HDInsight http://bit.ly/azurebigdata Microsoft Big Data http://bit.ly/sql2012bigdata Analytics Platform System http://bit.ly/msapspdw Thank You! Email andrew.brust@bluebadgeinsights.com Twitter @andrewbrust on twitter