AWS Serverless Architecture Think Big

Similar documents
Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Serverless Computing. Redefining the Cloud. Roger S. Barga, Ph.D. General Manager Amazon Web Services

Hadoop. Introduction / Overview

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Lambda Architecture for Batch and Stream Processing. October 2018

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

CloudExpo November 2017 Tomer Levi

Configuring and Deploying Hadoop Cluster Deployment Templates

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data Analytics using Apache Hadoop and Spark with Scala

Cloud Computing & Visualization

Big Data Hadoop Course Content

Splunk & AWS. Gain real-time insights from your data at scale. Ray Zhu Product Manager, AWS Elias Haddad Product Manager, Splunk

DATA SCIENCE USING SPARK: AN INTRODUCTION

Hadoop, Yarn and Beyond

What s New at AWS? A selection of some new stuff. Constantin Gonzalez, Principal Solutions Architect, Amazon Web Services

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

AWS Lambda: Event-driven Code in the Cloud

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Big Data Hadoop Stack

MapR Enterprise Hadoop

Gabriel Villa. Architecting an Analytics Solution on AWS

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Cloud Analytics and Business Intelligence on AWS

Hadoop Development Introduction

Innovatus Technologies

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

AWS 101. Patrick Pierson, IonChannel

Big Data with Hadoop Ecosystem

The Orion Papers. AWS Solutions Architect (Associate) Exam Course Manual. Enter

Amazon Search Services. Christoph Schmitter

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

New Features and Enhancements in Big Data Management 10.2

Scalable Tools - Part I Introduction to Scalable Tools

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Big Data Infrastructure at Spotify

Oracle Big Data Fundamentals Ed 2

Stages of Data Processing

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

Distributed Systems CS6421

HDInsight > Hadoop. October 12, 2017

Oracle Data Integrator 12c: Integration and Administration

Hadoop An Overview. - Socrates CCDH

Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

Certified Big Data and Hadoop Course Curriculum

Oracle Data Integrator 12c: Integration and Administration

Certified Big Data Hadoop and Spark Scala Course Curriculum

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

arxiv: v1 [cs.dc] 20 Aug 2015

Containers or Serverless? Mike Gillespie Solutions Architect, AWS Solutions Architecture

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Data Lake Best Practices

Big Data Architect.

HBASE + HUE THE UI FOR APACHE HADOOP

Oracle Big Data Fundamentals Ed 1

Configuring Intelligent Streaming 10.2 For Kafka on MapR

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Immersion Day. Getting Started with AWS Lambda. August Rev

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Prototyping Data Intensive Apps: TrendingTopics.org

Microsoft Big Data and Hadoop

Energy Management with AWS

microsoft

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Reactive Microservices Architecture on AWS

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Data in the Cloud and Analytics in the Lake

Modern Data Warehouse The New Approach to Azure BI

Extend NonStop Applications with Cloud-based Services. Phil Ly, TIC Software John Russell, Canam Software

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Hortonworks Certified Developer (HDPCD Exam) Training Program

How to go serverless with AWS Lambda

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

High Performance and Cloud Computing (HPCC) for Bioinformatics

Hadoop Online Training

Write On Aws. Aws Tools For Windows Powershell User Guide using the aws tools for windows powershell (p. 19) this section includes information about

An Introduction to Big Data Formats

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

THINK DIGITAL RETHINK LEGACY

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Data Storage Infrastructure at Facebook

Modernizing Business Intelligence and Analytics

Transcription:

MAKING BIG DATA COME ALIVE AWS Serverless Architecture Think Big Garrett Holbrook, Data Engineer Feb 1 st, 2017

Agenda What is Think Big? Example Project Walkthrough AWS Serverless 2

Think Big, a Teradata Company Big Data Consulting Roadmaps Training Strategy & Architecture Implementation Acquired by Teradata in 2014 Open source 3

About Us Garrett Holbrook Graduated Neumont University with BS in CS With Think Big ~1 year Mike Forsyth Graduated BYU with BS in Computer Engineering With Think Big since May 2016 Max Goff Think Big Academy 4

Example implementation walkthrough Company has a lot of data stored in RDBMS RDBMS costly to manage and is underperforming in certain queries Hoping Hadoop can provide reduced costs and better performance Hired Think Big to help 5

Next step Evaluate use cases and prioritize Install MapReduce and HDFS on their servers Write some MapReduce jobs Done? 6

Not that simple, unfortunately/fortunately Hadoop ecosystem has mind-boggling number of technologies 7

Hadoop Ecosystem 8 2017 Think Big, a Teradata Company 2/1/17

Not that simple, unfortunately/fortunately Hadoop ecosystem has mind-boggling number of technologies Each of these technologies fulfills some business or technical need MapReduce is only the tip of the iceberg 9

Sqoop Built for efficiently transferring bulk data between HDFS and relational databases Source: blogs.apache.org/sqoop 10

Spark Engine for general distributed big data processing Accomplishes same goal as MapReduce, but does it better Spark API provides functions in addition to map and reduce 11

Hive + Tez Hadoop s data warehouse SQL is the language of Hive. Turns SQL queries into MapReduce Jobs Newer versions (including stable release) use Tez for better performance SQL skills carry over It is NOT a relational database despite the usage of SQL 12

Hue Web interface for data analysis on Hadoop SQL editor for use with Hive, Phoenix, etc. Spark notebooks Can be used as the main tool for users to gain access to a hadoop cluster 13

Hue Source: gethue.com 14

Example Implementation 1. Sqoop to import data from RDBMS into Hadoop Relational Database.cfg.cfg.csv event_id source_location event_xml 15234 40.741895,-73.989308 <Header><Event> 15235 35.689487,139.691706 <Header><Event> 15

Example Implementation 2. Spark to flatten XML data.cfg.cfg.csv Flatten.cfg.cfg.csv source_country event_type_code event_timestamp JP 5 1484911141 US 2 1484914741 16

Example Implementation 3a. Hive to write sample to sample table.cfg.cfg.csv Sample.cfg.cfg.orc sample_table 17

Example Implementation 3b. Hive to run SQL query using distributed and scalable processing engine.cfg.cfg.csv Query.cfg.cfg.orc query_result 18

Example Implementation 4. Hue for visualization and analysis query_result sample_table User, Analyst, etc. 19

Administration Have a plan, engineers are ready, now what? Build out the cluster Provision hardware - On-site or cloud? Install Hadoop, Spark, Sqoop, Hue, Hive, and in reality many more Test cluster stability Set up security And more, all on open-source software... 20

Administration Things that make your life easier when building a cluster Hadoop Admins Hadoop Distributions Hadoop distributions Hortonworks Data Platform (HDP) and Cloudera Provide version compatibility Support Additional software 21

HDP Source: hortonworks.com 22

What is AWS? Amazon Web Services is the leading cloud services provider What is cloud? Renting servers Redundant data storage AWS has a lot of services built on top of their cloud infrastructure 23

AWS EC2 Elastic Cloud Compute (EC2) is a cloud service that lets you rent servers Define hardware details For example, a t2.large instance gives you 2 vcpus and 8 GB of memory Specify how much storage you need Define the OS image Redhat, Ubuntu, Windows Server, etc. Launch instance 24

AWS S3 Simple Storage Service (S3) is a redundant data storage service Files are called objects in S3 and folders are called buckets Charged per GB/month of data stored in S3 25

AWS Kinesis Kinesis Streams Distributed, fault-tolerant messaging queue Fit for small, high frequency data Kinesis Firehose Writes streaming data directly to S3 and other AWS storage services Kinesis Analytics Run SQL on a Kinesis Stream 26

AWS Lambda Run code in the cloud without worrying about servers Define a function Java, Node, Python, and now C# Define a trigger File put in S3 Data sent to a Kinesis Stream Lambda function called directly AWS will deploy your code and run it whenever the function is triggered 27

What do we mean by serverless? Any cloud service where the details and operations of the server are not exposed to the user of the services Lambda Kinesis DynamoDB S3 Athena Not EC2 28

Where would you use this In implementation example, administration is a large inhibitor of success and development speed Even with Hadoop distributions and support, getting everything installed and configured correctly is a large effort Server administration is still a big part of hadoop administration OS updates OS level security Space concerns (log files get out of hand) Support is costly, and the hours spent on administration are costly Capacity planning, especially if cluster on-site Scaling based off of load Serverless potentially alleviates these issues 29

Serverless Example Implementation 1. Records written to a Kinesis Firehose delivery stream. Firehose batches up records and puts them in S3 Kinesis Firehose S3 Relational Database.cfg.cfg.csv Delivery stream Bucket 30

Serverless Example Implementation 2. Lambda function triggers on write to S3, flattens the records, and pushes them to a kinesis stream S3 Lambda Kinesis Stream Trigger Flattened rows Bucket Flatten 31

Serverless Example Implementation 3. Kinesis analytics is used to write a sample to S3 and to run the query Kinesis Analytics Kinesis Firehose S3 Kinesis Stream Query Delivery stream Bucket Sample Delivery stream Bucket 32

Serverless Example Implementation 4. Athena is used for ad-hoc query and analysis S3 33

34 2017 Think Big, a Teradata Company 2/1/17