Data Lake Best Practices
Agenda Why Data Lake Key Components of a Data Lake Modern Data Architecture Some Best Practices Case Study Summary Takeaways
What is a Data Lake?
What, why etc. What is a data lake? It is an architecture that allows you to collect, store, process, analyze and consume all data that flows into your organization. Why data lake? Leverage all data that flows into your organization Customer centricity Business agility Better predictions via Machine Learning Competitive advantage
Comparison of a Data Lake to an Enterprise Data Warehouse Complementary to EDW (not replacement) Schema on read (no predefined schemas) Structured/semi-structured/Unstructured data Fast ingestion of new data/content Data Science + Prediction/Advanced Analytics + BI use cases Data at low level of detail/granularity Loosely defined SLAs EMR Flexibility in tools (open source/tools for advanced analytics) S3 Enterprise DW Data lake can be source for EDW Schema on write (predefined schemas) Structured data only Time consuming to introduce new content BI use cases only (no prediction/advanced analytics) Data at summary/aggregated level of detail Tight SLAs (production schedules) Limited flexibility in tools (SQL only)
Key Concepts Associated with a Data Lake
COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE STORAGE
Components of a Data Lake API & UI Entitlements Catalogue & Search Storage & Streams Data Storage High durability Stores raw data from input sources Support for any type of data Low cost Streaming Streaming ingest of feed data Provides the ability to consume any dataset as a stream Facilitates low latency analytics
Components of a Data Lake API & UI Entitlements Catalogue & Search Storage & Streams Catalogue Metadata lake Used for summary statistics and data Classification management Search Simplified access model for data discovery
Components of a Data Lake API & UI Entitlements Catalogue & Search Storage & Streams Entitlements system Encryption Authentication Authorisation Chargeback Quotas Data masking Regional restrictions
Components of a Data Lake API & UI Entitlements Catalogue & Search API & User Interface Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected Storage & Streams
The Modern Data Architecture
API & UI Entitlements Catalogue & Search Storage & Streams
API & UI Entitlements Catalogue & Search Storage & Streams
API & UI Entitlements Catalogue & Search Storage & Streams
Why Is Amazon S3 the Fabric of Data Lake? Natively supported by big data frameworks (Spark, Hive, Presto, etc.) Decouple storage and compute No need to run compute clusters for storage (unlike HDFS) Can run transient Hadoop clusters & Amazon EC2 Spot Instances Multiple & heterogeneous analysis clusters can use the same data Virtually unlimited number of objects and volume of data Very high bandwidth no aggregate throughput limit Designed for 99.99% availability can tolerate zone failure Designed for 99.999999999% durability No need to pay for data replication Native support for versioning Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies Use HDFS for very frequently accessed (hot) data Secure SSL, client/server-side encryption at rest Low cost
API & UI Entitlements Catalogue & Search Storage & Streams
Indexing and Searching using Metadata ObjectCreated ObjectDeleted PutItem AWS Lambda Update Stream Metadata Index (DynamoDB) Extract Search Fields Update Index AWS Lambda Search Index (Amazon Elasticsearch Service or Amazon CloudSearch)
API & UI Entitlements Catalogue & Search Storage & Streams
Identity & Access Management Manage users, groups, and roles Identity federation with Open ID Temporary credentials with Amazon Security Token Service (Amazon STS) Stored policy templates Powerful policy language Amazon S3 bucket policies
Data Encryption AWS CloudHSM Dedicated Tenancy SafeNetLuna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 AWS Key Management Service Automated key rotation & auditing Integration with other AWS services AWS server side encryption AWS managed key infrastructure
API & UI Entitlements Catalogue & Search Storage & Streams
Data Lake API & UI Exposes the Metadata API, search, and Amazon S3 storage services to customers Can be based on TVM/STS Temporary Access for many services, and a bespoke API for Metadata Drive all UI operations from API?
Introducing Amazon API Gateway Host multiple versions and stages of APIs Create and distribute API keys to developers Leverage AWS Sigv4 to authorize access to APIs Throttle and monitor requests to protect the backend Leverages AWS Lambda
API & UI Entitlements Catalogue & Search Storage & Streams
API & UI Entitlements Catalogue & Search Storage & Streams
API & UI Entitlements Catalogue & Search Storage & Streams
Data Integration Partners Reduce the effort to move, cleanse, synchronize, manage, and automatize data related processes. https://aws.amazon.com/big-data/partner-solutions/
Putting it all together
Building a Data Lake on AWS 4 6 1 8 10 3 Kinesis Firehose 2 Athena Query Service Batch 9 5 7 Glue
Processing Data for Analytics on your data lake
Processing & Analytics Real-time Batch Kinesis Streams & Firehose Elasticsearch Service Spark Streaming on EMR AWS Lambda Kinesis Analytics, Kinesis Streams Apache Flink on EMR Apache Storm on EMR EMR Hadoop, Spark, Presto Redshift Data Warehouse Athena Query Service Amazon Lex Speech recognition Amazon Rekognition AI & Predictive Amazon Polly Text to speech Machine Learning Predictive analytics DynamoDB NoSQL DB Transactional & RDBMS Aurora Relational Database BI & Data Visualization
Important considerations
Data Temperature Hot Warm Cold Volume MB GB GB TB PB EB Item size B KB KB MB KB TB Latency ms ms, sec min, hrs Durability Low high High Very high Request rate Very high High Low Cost/GB $$-$ $- Hot data Warm data Cold data
Which Stream/Message Storage Should I Use? Amazon DynamoDB Streams Amazon Kinesis Streams Amazon Kinesis Firehose Apache Kafka Amazon SQS (Standard) Amazon SQS (FIFO) New AWS managed Yes Yes Yes No Yes Yes Guaranteed ordering Yes Yes No Yes No Yes Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once Data retention period 24 hours 7 days N/A Configurable 14 days 14 days Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ Scale / throughput No limit / ~ table IOPS No limit / ~ shards No limit / automatic No limit / ~ nodes No limits / automatic Parallel consumption Yes Yes No Yes No No Stream MapReduce Yes Yes N/A Yes N/A N/A 300 TPS / queue Row/object size 400 KB 1 MB Destination row/object size Configurable 256 KB 256 KB Cost Higher (table cost) Hot Low Low Low (+admin) Low-medium Low-medium Warm
Analytics Types & Frameworks Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Sub second: ElastiCache (Redis 3.2 TiB, MemCache), SAP Hana Message Takes milliseconds to seconds Example: Message processing Amazon SQS applications on Amazon EC2 Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Storm, AWS Lambda Artificial Intelligence Takes milliseconds to minutes Example: Fraud detection, forecast demand, text to speech Amazon AI (Lex, Polly, ML, Rekognition), Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe) Stream Message Batch Interactive AI PROCESS / ANALYZE Amazon AI Amazon Redshift Amazon Athena Presto Amazon SQS apps Amazon EC2 Streaming Amazon Kinesis Analytics KCL apps AWS Lambda EMR Amazon EC2 Amazon EMR Fast Slow Fast
Slow Which Analysis Tool Should I Use? Use case Amazon Redshift Amazon Athena Amazon EMR Optimized for data warehousing Ad-hoc Interactive Queries Presto Spark Hive Interactive Query General purpose (iterative ML, RT,..) Scale/throughput ~Nodes Automatic / No limits ~ Nodes AWS Managed Service Yes Yes, Serverless Yes Storage Local storage AmazonS3 Amazon S3, HDFS Batch Optimization Columnar storage, data compression, and zone maps CSV, TSV, JSON, Parquet, ORC, Apache Web log Framework dependent Metadata Amazon Redshift managed Athena Catalog Manager Hive Meta-store BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom) Access controls Users, groups, and access controls AWS IAM Integration with LDAP UDF support Yes (Scalar) No Yes
Case Study
Case Study: Re-architecting Compliance For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren t able to do before, and that is priceless. - Steve Randich, CIO What FINRA needed Infrastructure for its market surveillance platform Support of analysis and storage of approximately 75 billion market events every day Why they chose AWS Fulfillment of FINRA s security requirements Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized Increased agility, speed, and cost savings Estimated savings of $10-20m annually by using AWS
Fraud Detection FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion trading events per day and securely store over 5 petabytes of data, attaining savings of $10-20mm per year.
Summary
AWS enables you to build sophisticated data lakes and related analytics applications Retrospective, Real-time, Predictive You can build incrementally, adding use cases and increasing scale as you go AWS provides a broad range of security and auditing features to enable you to meet your security requirements https://aws.amazon.com/big-data/
Takeaways
Prescriptive guidance and rapidly deployable solutions to help you store, analyze, and process big data on the AWS Cloud Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight http://amzn.to/2lpbc8p http://amzn.to/2qpifak Deploying a Data Lake on AWS - March 2017 AWS Online Tech Talks Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series - YouTube http://bit.ly/2qipa8h http://amzn.to/2mzgppl http://bit.ly/2qielyx
?