Cloud Analytics and Business Intelligence on AWS
Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse Data Pipeline App Services Queuing & Notifications Workflow App streaming Transcoding Email Search Deployment & Management One-click web app deployment Dev/ops resource management Resource Templates Code Deploy Code Pipeline Code Commit Mobile Services Identity Sync Mobile Analytics Push Notifications Administration & Security Identity Management Access Control Usage & Resource Tracking Service Catalog Key Storage & Management Monitoring and Logs Core Services Compute (VMs, Auto-scaling and Load Balancing) Storage (Object, Block and Archival) CDN Databases (Relational, NoSQL, Caching) Networking (VPC, DX, DNS) Infrastructure Regions Availability Zones Points of Presence
Availability 99.99% Durability 99.999999999% A Distributed Object Store Not a file system No Single Points of Failure Eventually consistent Simple Storage Service Highly scalable object storage for the internet 1 byte to 5TB in size 99.999999999% durability Paradigm Performance Redundancy Security Pricing Typical use case Object store Very Fast Across Availability Zones Public Key / Private Key $0.03/GB/month Write once, read many
S3 Performance & Scalability Reader Connections Amazon S3 provides near linear scalability 34 secs per terabyte S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr 350 VMs; 28.7GB/s; $90/hr GB/Second
Application Services Deployment & Administration App Services Compute Storage Networking Analytics Databas e AWS Global Infrastructure Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add and Remove Shards for Performance Use Kinesis Worker Library to Process Data
Amazon Kinesis AWS Endpoint Data Sources Data Sources Availability Zone Availability Zone Availability Zone App.1 [Aggregate & De-Duplicate] App.2 S3 Data Sources Shard 1 Shard 2 Shard N [Metric Extraction] DynamoDB Data Sources App.3 [Sliding Window Analysis] Data Sources App.4 [Machine Learning] Redshift
AWS Security Services Cloud HSM Dedicated Tenancy SafeNet Luna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 Deployment & Administration App Services Compute Storage Networking Analytics Databas e AWS Global Infrastructure AWS Key Management Service Implemented on HSM Automated Key Rotation & Auditing Integration with other AWS Services AWS Server Side Encryption AWS Managed Key Infrastructure
Structured Data Management
Database RDS Redshift Dynamo DB Elasticache Deployment & Administration App Services Compute Storage Networking Analytics Database Relational Database Service Managed Oracle, MySQL & SQL Server Dynamo DB Managed NOSQL Database ElastiCache Managed In Memory Caching Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse AWS Global Infrastructure
Database RDS Dynamo DB Redshift Elasticache Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure Relational Database Service Database-as-a-Service No need to install or manage database instances Scalable and fault tolerant configurations
Database RDS Dynamo DB Redshift Elasticache Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive
Dynamo Consistency Writes Writes are acknowledged (committed) once they exist in at least two physical data centers Writes are persisted to SSD Reads Tunable for Application Requirements No reduction in durability or consistency in order to achieve throughput Eventually Consistent Read Stale Values reads possible Highest Throughput Strongly Consistent Read No Stale Values read Lower Potential Throughput
Database RDS Dynamo DB Redshift Elasticache Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Load data from S3, DynamoDB and EMR Extensive Security Features Scale from 160 GB -> 1.6 PB Online
Redshift Parallelizes Everything Common BI Tools Query JDBC/ ODBC Load Leader Node Backup 1 0 GigE Mesh Restore Compute Node Resize Compute Node Compute Node
Exploratory Analytics Data Cleansing Advanced Data Science
Managed Big Data Elastic MapReduce Deployment & Administration App Services Compute Storage Networking Analytics Databas e AWS Global Infrastructure Elastic MapReduce Managed, elastic Hadoop (1.x & 2.x) cluster Integrates with S3, DynamoDB and Redshift Install End User Tools Automatically (Spark, Impala) Support for EC2 Spot Instances
Vibrant Ecosystem Pig HDFS EMR
Weather Insurance for Farms Challenge: Volatile weather is deadly to crops like grapes Solution: Built a predictive model based on freely available data: 150B Soil Observations 60 years of crop data 200 TB of S3 Data 1M government Doppler radar points 850K Precision Rainfall Grids Tracked 3M Daily Weather Measurements 50 EMR clusters process new data as it comes into S3 each day, continuously updating the model
Choose your instance types General m3 family CPU c3 family cc2.8xlarge d2 family Memory m2 family r3 family Disk/IO d2 family i2 family ETL Machine Learning Spark HDFS Try different configurations to find the optimal cost/performance balance
Custom Intel Xeon processors for AWS C4 = highest performing EC2 instances New EC2 Instances C4
The Financial Industry Regulatory Authority 30 Billion Market Events / Day Objective to react to changing Market Dynamics Amazon Elastic MapReduce & Amazon S3 $10-20M Savings by moving Platform to AWS
Event Processing Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure AWS Lambda Fully Managed Event Processor Node.js, Integrated AWS SDK & ImageMagick Natively Compile & Install Node.js modules Specify Runtime RAM & Timeout Automatically Scaled to support Event Volume Events from S3, Dynamo DB, Kinesis & Lambda Integrated CloudWatch Logging
Introducing Amazon Machine Learning SDE expertise Easily create machine learning models Visualize and optimize models Put models into production in seconds Machine Learning expertise Battle-hardened technology
Easy to Use, High Performance Train and optimize models on GBs of data Batch process predictions Real-time prediction API in one-click No servers to provision or manage
Developing with Amazon Machine Learning 1 2 3 Build model Validate & optimize Make predictions
Building a Predictive Model with Amazon Machine Learning Use existing data in S3, Redshift and RDS Automatic data visualization & exploration Descriptive and summary statistics Your data doesn t have to be perfect Missing data, malformed data records, type validation
Model Validation and Optimization Tools
Making Predictions with Amazon Machine Learning Batch predictions Asynchronous predictions with trained model Real time predictions Synchronous, low latency, high throughput Mount API end-point with a single click
Traditional Business Intelligence OLAP Data Sources for ML
Managed Data Warehouse RDS Dynamo DB Redshift ElastiCache Deployment & Administration App Services Compute Storage Networking Analytics Databas e AWS Global Infrastructure Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Load data from S3, DynamoDB and EMR Extensive Security Features Scale from 160 GB -> 1.6 PB Online
Redshift lets you start small and grow big Extra Large Node (dw1.xl & dw2.xl) 3 spindles, 15GiB RAM 2 virtual cores, 10GigE Single Node (160GB SSD or 2TB Magnetic) Cluster 2-32 Nodes (320GB SSD 64TB Magnetic) 8 Extra Large Node (dw1.8xl & dw2.8xl) 24 spindles, 120GiB RAM, 1.2TB SSD or 16TB Magnetic, 16 virtual cores, 10GigE Cluster 2-100 Nodes (2.4TB SSD 1.6PB Magnetic) 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
End User Reporting EMR Redshift S3 Dynamo DB
Ignite Your Ambition 26 Markets 3 Clearing Houses 5 Central Securities Depositories Leading Index Provider With 41,000+ Indexes Across Asset Classes And Geographies Over 10,000 Corporate Clients in 60 countries Lists more than 3,500 companies in 35 countries, representing more than $8.8 trillion in total market value 100+ DATA PRODUCT OFFERINGS supporting 2.5+ million investment professionals and users IN 98 COUNTRIES Our technology powers over 70 MARKETPLACES, regulators, CSDs and clearinghouses in over 50 COUNTRIES 34
NDW 1.0 Requirements Original scope was to replace on-premises warehouse with Redshift, keeping equivalent schemas and data 4-8 Billion Rows/Day Legacy limited to 1 Year Retention Must be lower cost than legacy system Legacy $1.16M/Year Must satisfy multiple security and regulatory requirements Must perform similarly to legacy warehouse under concurrent query load
Migration Completed On Schedule Migrated off legacy warehouse to Redshift (start to finish) in 7 man-months Redshift costs were 43% of legacy budget for the same data set (~1100 tables) Tuned queries now running faster than on legacy system Data Ingest 5.5B rows/day average for 2014 High water mark: 14B rows in 1 day Best write rates ~2.76M rows/second 450 GB/day (after compression) into Redshift 1,895 GB/day average uncompressed Currently resize clusters once a quarter (if necessary) NDW_Prod is currently growing +3 dw1.8xl nodes per quarter
Integrated Analytics