Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Similar documents
Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

BIG DATA COURSE CONTENT

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Hadoop. Introduction / Overview

Stages of Data Processing

HDInsight > Hadoop. October 12, 2017

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data with Hadoop Ecosystem

Oracle GoldenGate for Big Data

Big Data Architect.

FROM ZERO TO PORTABILITY

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Big Data Hadoop Stack

microsoft

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

New Approaches to Big Data Processing and Analytics

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Big Data Analytics using Apache Hadoop and Spark with Scala

The age of Big Data Big Data for Oracle Database Professionals

Azure Data Factory VS. SSIS. Reza Rad, Consultant, RADACAD

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Accelerate Your Data Pipeline for Data Lake, Streaming and Cloud Architectures

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Oracle Big Data Fundamentals Ed 2

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Hortonworks and The Internet of Things

Configuring and Deploying Hadoop Cluster Deployment Templates

Alexander Klein. #SQLSatDenmark. ETL meets Azure

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Innovatus Technologies

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Apache Beam. Modèle de programmation unifié pour Big Data

SpagoBI and Talend jointly support Big Data scenarios

Microsoft Big Data and Hadoop

Apache Beam: portable and evolutive data-intensive applications

Flash Storage Complementing a Data Lake for Real-Time Insight

Big Data Hadoop Course Content

Cloud Computing & Visualization

Informatica Enterprise Information Catalog

Certified Big Data Hadoop and Spark Scala Course Curriculum

MapR Enterprise Hadoop

Streaming Integration and Intelligence For Automating Time Sensitive Events

Hadoop Development Introduction

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Data Acquisition. The reference Big Data stack

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Data Architectures in Azure for Analytics & Big Data

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Modern Data Warehouse The New Approach to Azure BI

Lambda Architecture for Batch and Stream Processing. October 2018

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

AWS Serverless Architecture Think Big

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Introduction into Big Data analytics Lecture 2 Big data platforms. Janusz Szwabiński

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Top 7 Data API Headaches (and How to Handle Them) Jeff Reser Data Connectivity & Integration Progress Software

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Certified Big Data and Hadoop Course Curriculum

CHOOSING A DATABASE- AS-A-SERVICE

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Exam Questions

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Spotfire Advanced Data Services. Lunch & Learn Tuesday, 21 November 2017

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

New Features and Enhancements in Big Data Management 10.2

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Leverage the Oracle Data Integration Platform Inside Azure and Amazon Cloud

Denodo Platform 7.0. Datasheet

Cloud Analytics and Business Intelligence on AWS

Tour of Database Platforms as a Service. June 2016 Warner Chaves Christo Kutrovsky Solutions Architect

Ian Choy. Technology Solutions Professional

CSE 444: Database Internals. Lecture 23 Spark

What is Gluent? The Gluent Data Platform

Hadoop An Overview. - Socrates CCDH

Data Lake Best Practices

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

STATE OF MODERN APPLICATIONS IN THE CLOUD

Unifying Big Data Workloads in Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Azure Data Factory. Data Integration in the Cloud

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Introduction to BigData, Hadoop:-

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

WHITEPAPER. MemSQL Enterprise Feature List

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

TIBCO Data Virtualization

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Eight Essential Checklists for Managing the Analytic Data Pipeline

What s New at AWS? A selection of some new stuff. Constantin Gonzalez, Principal Solutions Architect, Amazon Web Services

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

The TIBCO Insight Platform 1. Data on Fire 2. Data to Action. Michael O Connell Catalina Herrera Peter Shaw September 7, 2016

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Transcription:

Modern ETL Tools for Cloud and Big Data Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Agenda Landscape Cloud ETL Tools Big Data ETL Tools Best Practices QnA 2

Landscape 3

Data Warehousing Has Transformed Enterprise BI and Analytics 4

ETL Tools Are At The Heart Of EDW 5

Rise Of The Modern EDW Solutions Cloud Data Warehouse Big Data Warehouse 6

Enterprises Are Increasingly Adopting Cloud and Big Data Cloud Big Data NoSQL Amazon Web Services Microsoft Azure VMware Google Cloud Salesforce Oracle IBM SAP HANA Rackspace RedHat OpenStack RedHat OpenShift Digital Ocean CenturyLink Linode Other (please specify): None Not sure 22% 18% 14% 10% 8% 6% 5% 4% 3% 2% 1% 0% 6% 13% 3% 44% 39% Amazon S3 Hadoop Hive Spark SQL Hortonworks Amazon EMR Apache Solr Cloudera CDH Oracle BDA Apache Sqoop IBM BigInsights SAP Cloud Platform Big Data Cloudera Impala MapR Azure HDInsights Apache Storm Google Cloud DataProc Apache Drill Apache Phoenix Presto Pivotal HD GemFireXD Other (please specify): None Not sure 9% 9% 8% 7% 7% 7% 7% 6% 6% 6% 6% 5% 5% 4% 3% 2% 2% 1% 2% 15% 12% 24% 22% 32% MongoDB Cassandra Redis Oracle NoSQL Google Cloud HBase DynamoDB Couchbase Google Cloud DocumentDB SimpleDB MarkLogic DataStax Enterprise Aerospike Riak Scylla ArangoDB Other (please None Not sure 27% 11% 7% 7% 6% 6% 5% 4% 3% 2% 2% 1% 1% 1% 0% 0% 0% 2% 32% 10% Source: Progress DataDirect s Data Connectivity Outlook Survey 2018 7

Cloud ETL Tools 8

Poll Question 1 Which cloud provider are you currently using / plan to use? AWS Google Cloud Platform Azure Other None 9

AWS Glue Serverless ETL

AWS Glue Components: Data Catalog Crawlers ETL jobs/scripts Job scheduler Useful for running serverless queries against S3 buckets and relational data creating event-driven ETL pipelines automatically discovering and cataloging your enterprise data assets 11

AWS Glue architecture Source: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html 12

Data catalog - the central component Can act as metadata repository for other Amazon services Tables - Added to databases using the wizard or a crawler Data sources: Amazon S3, Redshift, Aurora, Oracle, PostgreSQL, MySQL, MariaDB, MS SQL Server, JDBC, DynamoDB Crawlers connect to one or more data stores, determine the data structures, and write tables into the Data Catalog 13

Jobs PySpark or Scala scripts, generated by AWS Glue Visual dataflow can be generated, but not used for development Execute ETL using the job scheduler, events, or manually invoke Built-in transforms used to process data ApplyMapping Maps source and target columns Filter Load new DynamicFrame based on filtered records Join Joins two DynamicFrames SelectFields Output selected fields to new DynamicFrame SplitRows Split rows into two new DynamicFrames based on a predicate SplitFields Split fields into two new DynamicFrames 14

Azure Data Factory Visual Cloud ETL

Azure Data Factory Build data pipelines using a visual ETL user interface Visual Studio Team Services (VSTS) Git integration for collaboration, source control, and versioning Drag, drop, link activities Copy Data: Source to Target Transform: Spark, Hive, Pig, streaming on HDInsight, Stored Procedures, ML Batch Execution, etc. Control flow: If-then-else, For-each, Lookup, etc. Source: https://azure.microsoft.com/en-us/blog/continuous-integration-and-deployment-using-data-factory/ 16

Linked services Allow connection to many data sources Source: Azure DBs, Azure Blob Storage, Azure File Storage, Oracle, SQL Server, SAP HANA, Sybase, DB2, Impala, Drill, MySQL, Google BigQuery, Cassandra, Amazon S3, HDFS, Salesforce, etc. Target: Azure DBs, Azure Blob Storage, Azure File Storage, Oracle, SQL Server, SAP HANA, File System, Generic ODBC, Salesforce, etc. 17

Integration runtime environment Compute environment required for Azure Data Factory Fully managed in Azure cloud infrastructure Automatic scaling, high availability, etc. Allows execution of SSIS packages Can be hosted on a local private network Organizations can control, and manage, compute resources 18

Google Cloud Dataflow Unified Programming Model for Data Processing

Google Cloud Dataflow Source: https://cloud.google.com/dataflow/ 20

Google Cloud Dataflow overview Unified programming model for batch (historical) and streaming (real-time) data pipelines and distributed compute platform Reuse code across both batch and streaming pipelines Java or Python based Programming model an open source project - Apache Beam (https://beam.apache.org/) Runs on multiple different distributed processing back-ends: Spark, Flink, Cloud Dataflow platforms Fully managed service Automated resource management and scale-out Source: https://cloud.google.com/dataflow/ 21

Dataflow I/O transforms Language File-based Messaging Database Java Python Beam Java supports Apache HDFS, Amazon S3, Google Cloud Storage, and local filesystems FileIO, AvroIO, TextIO, TFRecordIO, XmlIO, TikaIO, ParquetIO Many different filesystems and cloud object stores supported Beam Python supports Google Cloud Storage and local filesystems avroio, textio, tfrecordio, vcfio Amazon Kinesis AMQP Apache Kafka Google Cloud Pub/Sub JMS MQTT Google Cloud Pub/Sub Apache Cassandra Apache Hadoop InputFormat Apache Hbase Apache Hive (HCatalog) Apache Solr JDBC ElasticSearch Google BigQuery Google Cloud BigTable Google Cloud Datastore Google Cloud Spanner MongoDB Redis Google BigQuery JDBC allows custom connection to additional databases Google Cloud Datastore Other I/O transforms are already under development 22

Cloud ETL tools compared AWS Glue Google Dataflow Azure Data Factory Batch ETL X X X Streaming - X - User Interface - - * X Compute data platform - X - Cross-platform support X X X Custom connector support X X X Metadata catalog X - - Monitoring tools available X X X Fully managed X X X *Cloud Dataprep 23

Big Data ETL Tools 24

Poll Question 2 Which of the following tools are you currently using / plan to use? Apache Sqoop Apache Nifi Apache Flume Other None 25

Apache Sqoop Batch Data Transfer

Sqoop Overview Sqoop is a tool designed to transfer data between Hadoop and RDBMS or mainframes (bi-directional) Open source (Apache) command line tool Batch/Bulk transfer of structured data Requires Hadoop Leverages MapReduce to provide parallel execution and data partitioning 27

Sqoop Details Importing Data Bulk move data from RDBMS or mainframe to Hadoop Persist in HDFS, Hive, HBase or Accumulo Uses JDBC* for connectivity Able to perform column and row filtering or freeform queries Must use MapReduce, Pig or other scripting language * Very limited number of databases are supported and version specific Exporting Data Exports a set of files in HDFS to RDBMS Uses JDBC* for connectivity Leverages either update or insert mode or incremental imports Limited transformation support column filtering Constraints: Table must exist Operation is not atomic 28

Sqoop Details Sqoop jobs can created and saved to be used as templates Sqoop jobs can be scheduled using Oozie sqoop-merge can be used to combine two datasets into one and overwrite values of an older dataset Plugin architecture $ sqoop import --connect jdbc:mysql://localhost/userdb --username root --table emp --m 1 29

Apache Flume Log/Event collection, aggregation and transfer

Flume Overview Flume is a distributed, reliable and available service for collecting, aggregating and moving large amounts of streaming data into a backing store Open source (Apache) command line tool Transfer of high throughput, low latency, structured data Application logs, sensor, machine, geo-location and social data Does not require Hadoop but has many integrations with services that can leverage Hadoop effectively 31

Flume Details Ingesting data Data Flow Model Event A singular unit of data that is transported Source The entity through which data enters into Flume. Sources either poll for data or passively wait for data Sink Delivers data to the destination. A variety of sinks allow data to be streamed to a variety of destinations Channel The conduit between the source and the sink. Sources ingest events into the channel and sinks drain the channel Agent A collection of sources, sinks and channels running in a JVM 32

Apache Nifi Process, track and distribute data

Nifi Overview Platform tool built to automate and manage the flow of data between systems Visual Command and Control Data Provenance Data Prioritization Data Buffering/Back-Pressure Control Latency vs. Throughput Security Scale Out Clustering Extensibility When to Use Nifi? When to Not Use Nifi? 34

Nifi Data Flows Components of a Data Flow FlowFile FlowFile Processor Connection Flow Controller Process Group 35

Nifi Data Provenance Records the provenance of data as it flows through the system Searching, filtering and drill-in capabilities Events receive, send, clone, drop, fork, route, modify, join 36

Comparison of Big Data ETL Tools ETL Tool Purpose Limitation(s) Use Cases Sqoop Command line tool for bidirectional bulk data movement Flume Command line tool for collecting, aggregating and moving eventbased data Nifi Command and control framework for modeling, monitoring and moving data between systems OOTB connectivity does not support many versions or sources Can put strain on data source as the MapReduce jobs partition the data Message delivery is not guaranteed in some instances Channel backing store can limit scalability May have scalability issues with extreme volume/low latency data flows Extract operational data from RDBMS for processing in Hadoop Data archiving of historical or expired data to Hadoop Detect and report on web traffic anomalies in near real-time Monitoring global network traffic for threat detection and fault tolerance Everything above but greater flexibility in terms of developer and user experience 37

Poll Question 3 What data sources are most challenging to connect to from your ETL tool? SaaS sources (Salesforce, Eloqua, Oracle Service Cloud, Google Analytics, etc.) Relational Sources (Oracle, SQL Server, IBM DB2, etc.) Data Warehouses (RedShift, Teradata, Sybase, Greenplum, etc.) Big Data Sources (Hadoop, MongoDB, Cassandra, etc.) Other 38

EDW/ETL Best Practices 39

EDW/ETL Best Practices Simplify Access Data Governance Data Quality Data Modeling Self service Data breadth Tool training Create a data governance board Appoint data stewards Build a DQ firewall DQ assessment DQ measurement Incorporate DQ into functions/processes Let the business drive DQ Schema on read and schema on write Normalized and denormalized schemas Logical and physical data models 40

Resources Tutorials: AWS Glue + Salesforce JDBC Driver Google DataFlow + Salesforce into Google BigQuery Apache Sqoop: EXPORT to SQL Server from Hadoop Apache Nifi: Ingest Salesforce Data Incrementally into Hive Progress DataDirect: JDBC and ODBC Connectors Why DataDirect JDBC Drivers? Gluent: Integrate Enterprise Data with Gluent Cloud Sync and AWS Glue 41

Q&A 42