Data Lake Best Practices

Similar documents
Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Energy Management with AWS

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Lambda Architecture for Batch and Stream Processing. October 2018

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

Cloud Analytics and Business Intelligence on AWS

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

Serverless Computing. Redefining the Cloud. Roger S. Barga, Ph.D. General Manager Amazon Web Services

Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility. AWS Whitepaper

What s New at AWS? A selection of some new stuff. Constantin Gonzalez, Principal Solutions Architect, Amazon Web Services

Store, Protect, Optimize Your Healthcare Data in AWS

Gabriel Villa. Architecting an Analytics Solution on AWS

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Corriendo R sobre un ambiente Serverless: Amazon Athena

Real-time Streaming Applications on AWS Patterns and Use Cases

What s New at AWS? looking at just a few new things for Enterprise. Philipp Behre, Enterprise Solutions Architect, Amazon Web Services

Best Practices and Performance Tuning on Amazon Elastic MapReduce

QLIK INTEGRATION WITH AMAZON REDSHIFT

Flash Storage Complementing a Data Lake for Real-Time Insight

Capture Business Opportunities from Systems of Record and Systems of Innovation

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

AWS 101. Patrick Pierson, IonChannel

Integrating Splunk with AWS services:

Introduction to Database Services

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Extend NonStop Applications with Cloud-based Services. Phil Ly, TIC Software John Russell, Canam Software

Splunk & AWS. Gain real-time insights from your data at scale. Ray Zhu Product Manager, AWS Elias Haddad Product Manager, Splunk

BIG DATA COURSE CONTENT

WHITEPAPER. MemSQL Enterprise Feature List

Amazon AWS-Solution-Architect-Associate Exam

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

4) An organization needs a data store to handle the following data types and access patterns:

Experiences with Serverless Big Data

Mid-Atlantic CIO Forum

Managing Deep Learning Workflows

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Innovation at AWS. Eric Ferreira Principal Database Engineer Amazon Redshift

microsoft

Streaming Data: The Opportunity & How to Work With It

Amazon Web Services. Block 402, 4 th Floor, Saptagiri Towers, Above Pantaloons, Begumpet Main Road, Hyderabad Telangana India

The Orion Papers. AWS Solutions Architect (Associate) Exam Course Manual. Enter

AWS Storage Optimization. AWS Whitepaper

Agenda. Introduction Storage Primer Block Storage Shared File Systems Object Store On-Premises Storage Integration

Practical Applications of Machine Learning for Image and Video in the Cloud

Overview of AWS Security - Database Services

AWS Storage Gateway. Amazon S3. Amazon EFS. Amazon Glacier. Amazon EBS. Amazon EC2 Instance. storage. File Block Object. Hybrid integrated.

Reactive Microservices Architecture on AWS

5 Fundamental Strategies for Building a Data-centered Data Center

Data Architectures in Azure for Analytics & Big Data

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

AWS IoT Overview. July 2016 Thomas Jones, Partner Solutions Architect

CloudExpo November 2017 Tomer Levi

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

AWS Database Migration Service

Containers or Serverless? Mike Gillespie Solutions Architect, AWS Solutions Architecture

Leverage the Oracle Data Integration Platform Inside Azure and Amazon Cloud

Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

Microservices on AWS. Matthias Jung, Solutions Architect AWS

The age of Big Data Big Data for Oracle Database Professionals

MapR Enterprise Hadoop

What is Gluent? The Gluent Data Platform

Modernizing Business Intelligence and Analytics

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

AWS Serverless Architecture Think Big

Stages of Data Processing

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

2013 AWS Worldwide Public Sector Summit Washington, D.C.

The Technology of the Business Data Lake. Appendix

Data Analytics at Logitech Snowflake + Tableau = #Winning

Werden Sie ein Teil von Internet der Dinge auf AWS. AWS Enterprise Summit 2015 Dr. Markus Schmidberger -

Oracle GoldenGate for Big Data

HDInsight > Hadoop. October 12, 2017

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

Challenges for Data Driven Systems

Amazon Web Services. Foundational Services for Research Computing. April Mike Kuentz, WWPS Solutions Architect

Protecting Your Data in AWS. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

AWS Solutions Architect Associate (SAA-C01) Sample Exam Questions

Performance Efficiency Pillar

Big Data with Hadoop Ecosystem

Amazon Search Services. Christoph Schmitter

Security Aspekts on Services for Serverless Architectures. Bertram Dorn EMEA Specialized Solutions Architect Security and Compliance

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB

About Intellipaat. About the Course. Why Take This Course?

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

ADABAS & NATURAL 2050+

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Cloud Computing & Visualization

Informatica Data Lake Management on the AWS Cloud

Modern Data Warehouse The New Approach to Azure BI

What to expect from the session Technical recap VMware Cloud on AWS {Sample} Integration use case Services introduction & solution designs Solution su

Transcription:

Data Lake Best Practices

Agenda Why Data Lake Key Components of a Data Lake Modern Data Architecture Some Best Practices Case Study Summary Takeaways

What is a Data Lake?

What, why etc. What is a data lake? It is an architecture that allows you to collect, store, process, analyze and consume all data that flows into your organization. Why data lake? Leverage all data that flows into your organization Customer centricity Business agility Better predictions via Machine Learning Competitive advantage

Comparison of a Data Lake to an Enterprise Data Warehouse Complementary to EDW (not replacement) Schema on read (no predefined schemas) Structured/semi-structured/Unstructured data Fast ingestion of new data/content Data Science + Prediction/Advanced Analytics + BI use cases Data at low level of detail/granularity Loosely defined SLAs EMR Flexibility in tools (open source/tools for advanced analytics) S3 Enterprise DW Data lake can be source for EDW Schema on write (predefined schemas) Structured data only Time consuming to introduce new content BI use cases only (no prediction/advanced analytics) Data at summary/aggregated level of detail Tight SLAs (production schedules) Limited flexibility in tools (SQL only)

Key Concepts Associated with a Data Lake

COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE STORAGE

Components of a Data Lake API & UI Entitlements Catalogue & Search Storage & Streams Data Storage High durability Stores raw data from input sources Support for any type of data Low cost Streaming Streaming ingest of feed data Provides the ability to consume any dataset as a stream Facilitates low latency analytics

Components of a Data Lake API & UI Entitlements Catalogue & Search Storage & Streams Catalogue Metadata lake Used for summary statistics and data Classification management Search Simplified access model for data discovery

Components of a Data Lake API & UI Entitlements Catalogue & Search Storage & Streams Entitlements system Encryption Authentication Authorisation Chargeback Quotas Data masking Regional restrictions

Components of a Data Lake API & UI Entitlements Catalogue & Search API & User Interface Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected Storage & Streams

The Modern Data Architecture

API & UI Entitlements Catalogue & Search Storage & Streams

API & UI Entitlements Catalogue & Search Storage & Streams

API & UI Entitlements Catalogue & Search Storage & Streams

Why Is Amazon S3 the Fabric of Data Lake? Natively supported by big data frameworks (Spark, Hive, Presto, etc.) Decouple storage and compute No need to run compute clusters for storage (unlike HDFS) Can run transient Hadoop clusters & Amazon EC2 Spot Instances Multiple & heterogeneous analysis clusters can use the same data Virtually unlimited number of objects and volume of data Very high bandwidth no aggregate throughput limit Designed for 99.99% availability can tolerate zone failure Designed for 99.999999999% durability No need to pay for data replication Native support for versioning Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies Use HDFS for very frequently accessed (hot) data Secure SSL, client/server-side encryption at rest Low cost

API & UI Entitlements Catalogue & Search Storage & Streams

Indexing and Searching using Metadata ObjectCreated ObjectDeleted PutItem AWS Lambda Update Stream Metadata Index (DynamoDB) Extract Search Fields Update Index AWS Lambda Search Index (Amazon Elasticsearch Service or Amazon CloudSearch)

API & UI Entitlements Catalogue & Search Storage & Streams

Identity & Access Management Manage users, groups, and roles Identity federation with Open ID Temporary credentials with Amazon Security Token Service (Amazon STS) Stored policy templates Powerful policy language Amazon S3 bucket policies

Data Encryption AWS CloudHSM Dedicated Tenancy SafeNetLuna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 AWS Key Management Service Automated key rotation & auditing Integration with other AWS services AWS server side encryption AWS managed key infrastructure

API & UI Entitlements Catalogue & Search Storage & Streams

Data Lake API & UI Exposes the Metadata API, search, and Amazon S3 storage services to customers Can be based on TVM/STS Temporary Access for many services, and a bespoke API for Metadata Drive all UI operations from API?

Introducing Amazon API Gateway Host multiple versions and stages of APIs Create and distribute API keys to developers Leverage AWS Sigv4 to authorize access to APIs Throttle and monitor requests to protect the backend Leverages AWS Lambda

API & UI Entitlements Catalogue & Search Storage & Streams

API & UI Entitlements Catalogue & Search Storage & Streams

API & UI Entitlements Catalogue & Search Storage & Streams

Data Integration Partners Reduce the effort to move, cleanse, synchronize, manage, and automatize data related processes. https://aws.amazon.com/big-data/partner-solutions/

Putting it all together

Building a Data Lake on AWS 4 6 1 8 10 3 Kinesis Firehose 2 Athena Query Service Batch 9 5 7 Glue

Processing Data for Analytics on your data lake

Processing & Analytics Real-time Batch Kinesis Streams & Firehose Elasticsearch Service Spark Streaming on EMR AWS Lambda Kinesis Analytics, Kinesis Streams Apache Flink on EMR Apache Storm on EMR EMR Hadoop, Spark, Presto Redshift Data Warehouse Athena Query Service Amazon Lex Speech recognition Amazon Rekognition AI & Predictive Amazon Polly Text to speech Machine Learning Predictive analytics DynamoDB NoSQL DB Transactional & RDBMS Aurora Relational Database BI & Data Visualization

Important considerations

Data Temperature Hot Warm Cold Volume MB GB GB TB PB EB Item size B KB KB MB KB TB Latency ms ms, sec min, hrs Durability Low high High Very high Request rate Very high High Low Cost/GB $$-$ $- Hot data Warm data Cold data

Which Stream/Message Storage Should I Use? Amazon DynamoDB Streams Amazon Kinesis Streams Amazon Kinesis Firehose Apache Kafka Amazon SQS (Standard) Amazon SQS (FIFO) New AWS managed Yes Yes Yes No Yes Yes Guaranteed ordering Yes Yes No Yes No Yes Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once Data retention period 24 hours 7 days N/A Configurable 14 days 14 days Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ Scale / throughput No limit / ~ table IOPS No limit / ~ shards No limit / automatic No limit / ~ nodes No limits / automatic Parallel consumption Yes Yes No Yes No No Stream MapReduce Yes Yes N/A Yes N/A N/A 300 TPS / queue Row/object size 400 KB 1 MB Destination row/object size Configurable 256 KB 256 KB Cost Higher (table cost) Hot Low Low Low (+admin) Low-medium Low-medium Warm

Analytics Types & Frameworks Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Sub second: ElastiCache (Redis 3.2 TiB, MemCache), SAP Hana Message Takes milliseconds to seconds Example: Message processing Amazon SQS applications on Amazon EC2 Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Storm, AWS Lambda Artificial Intelligence Takes milliseconds to minutes Example: Fraud detection, forecast demand, text to speech Amazon AI (Lex, Polly, ML, Rekognition), Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe) Stream Message Batch Interactive AI PROCESS / ANALYZE Amazon AI Amazon Redshift Amazon Athena Presto Amazon SQS apps Amazon EC2 Streaming Amazon Kinesis Analytics KCL apps AWS Lambda EMR Amazon EC2 Amazon EMR Fast Slow Fast

Slow Which Analysis Tool Should I Use? Use case Amazon Redshift Amazon Athena Amazon EMR Optimized for data warehousing Ad-hoc Interactive Queries Presto Spark Hive Interactive Query General purpose (iterative ML, RT,..) Scale/throughput ~Nodes Automatic / No limits ~ Nodes AWS Managed Service Yes Yes, Serverless Yes Storage Local storage AmazonS3 Amazon S3, HDFS Batch Optimization Columnar storage, data compression, and zone maps CSV, TSV, JSON, Parquet, ORC, Apache Web log Framework dependent Metadata Amazon Redshift managed Athena Catalog Manager Hive Meta-store BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom) Access controls Users, groups, and access controls AWS IAM Integration with LDAP UDF support Yes (Scalar) No Yes

Case Study

Case Study: Re-architecting Compliance For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren t able to do before, and that is priceless. - Steve Randich, CIO What FINRA needed Infrastructure for its market surveillance platform Support of analysis and storage of approximately 75 billion market events every day Why they chose AWS Fulfillment of FINRA s security requirements Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized Increased agility, speed, and cost savings Estimated savings of $10-20m annually by using AWS

Fraud Detection FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion trading events per day and securely store over 5 petabytes of data, attaining savings of $10-20mm per year.

Summary

AWS enables you to build sophisticated data lakes and related analytics applications Retrospective, Real-time, Predictive You can build incrementally, adding use cases and increasing scale as you go AWS provides a broad range of security and auditing features to enable you to meet your security requirements https://aws.amazon.com/big-data/

Takeaways

Prescriptive guidance and rapidly deployable solutions to help you store, analyze, and process big data on the AWS Cloud Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight http://amzn.to/2lpbc8p http://amzn.to/2qpifak Deploying a Data Lake on AWS - March 2017 AWS Online Tech Talks Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series - YouTube http://bit.ly/2qipa8h http://amzn.to/2mzgppl http://bit.ly/2qielyx

?