MAKING BIG DATA COME ALIVE AWS Serverless Architecture Think Big Garrett Holbrook, Data Engineer Feb 1 st, 2017
Agenda What is Think Big? Example Project Walkthrough AWS Serverless 2
Think Big, a Teradata Company Big Data Consulting Roadmaps Training Strategy & Architecture Implementation Acquired by Teradata in 2014 Open source 3
About Us Garrett Holbrook Graduated Neumont University with BS in CS With Think Big ~1 year Mike Forsyth Graduated BYU with BS in Computer Engineering With Think Big since May 2016 Max Goff Think Big Academy 4
Example implementation walkthrough Company has a lot of data stored in RDBMS RDBMS costly to manage and is underperforming in certain queries Hoping Hadoop can provide reduced costs and better performance Hired Think Big to help 5
Next step Evaluate use cases and prioritize Install MapReduce and HDFS on their servers Write some MapReduce jobs Done? 6
Not that simple, unfortunately/fortunately Hadoop ecosystem has mind-boggling number of technologies 7
Hadoop Ecosystem 8 2017 Think Big, a Teradata Company 2/1/17
Not that simple, unfortunately/fortunately Hadoop ecosystem has mind-boggling number of technologies Each of these technologies fulfills some business or technical need MapReduce is only the tip of the iceberg 9
Sqoop Built for efficiently transferring bulk data between HDFS and relational databases Source: blogs.apache.org/sqoop 10
Spark Engine for general distributed big data processing Accomplishes same goal as MapReduce, but does it better Spark API provides functions in addition to map and reduce 11
Hive + Tez Hadoop s data warehouse SQL is the language of Hive. Turns SQL queries into MapReduce Jobs Newer versions (including stable release) use Tez for better performance SQL skills carry over It is NOT a relational database despite the usage of SQL 12
Hue Web interface for data analysis on Hadoop SQL editor for use with Hive, Phoenix, etc. Spark notebooks Can be used as the main tool for users to gain access to a hadoop cluster 13
Hue Source: gethue.com 14
Example Implementation 1. Sqoop to import data from RDBMS into Hadoop Relational Database.cfg.cfg.csv event_id source_location event_xml 15234 40.741895,-73.989308 <Header><Event> 15235 35.689487,139.691706 <Header><Event> 15
Example Implementation 2. Spark to flatten XML data.cfg.cfg.csv Flatten.cfg.cfg.csv source_country event_type_code event_timestamp JP 5 1484911141 US 2 1484914741 16
Example Implementation 3a. Hive to write sample to sample table.cfg.cfg.csv Sample.cfg.cfg.orc sample_table 17
Example Implementation 3b. Hive to run SQL query using distributed and scalable processing engine.cfg.cfg.csv Query.cfg.cfg.orc query_result 18
Example Implementation 4. Hue for visualization and analysis query_result sample_table User, Analyst, etc. 19
Administration Have a plan, engineers are ready, now what? Build out the cluster Provision hardware - On-site or cloud? Install Hadoop, Spark, Sqoop, Hue, Hive, and in reality many more Test cluster stability Set up security And more, all on open-source software... 20
Administration Things that make your life easier when building a cluster Hadoop Admins Hadoop Distributions Hadoop distributions Hortonworks Data Platform (HDP) and Cloudera Provide version compatibility Support Additional software 21
HDP Source: hortonworks.com 22
What is AWS? Amazon Web Services is the leading cloud services provider What is cloud? Renting servers Redundant data storage AWS has a lot of services built on top of their cloud infrastructure 23
AWS EC2 Elastic Cloud Compute (EC2) is a cloud service that lets you rent servers Define hardware details For example, a t2.large instance gives you 2 vcpus and 8 GB of memory Specify how much storage you need Define the OS image Redhat, Ubuntu, Windows Server, etc. Launch instance 24
AWS S3 Simple Storage Service (S3) is a redundant data storage service Files are called objects in S3 and folders are called buckets Charged per GB/month of data stored in S3 25
AWS Kinesis Kinesis Streams Distributed, fault-tolerant messaging queue Fit for small, high frequency data Kinesis Firehose Writes streaming data directly to S3 and other AWS storage services Kinesis Analytics Run SQL on a Kinesis Stream 26
AWS Lambda Run code in the cloud without worrying about servers Define a function Java, Node, Python, and now C# Define a trigger File put in S3 Data sent to a Kinesis Stream Lambda function called directly AWS will deploy your code and run it whenever the function is triggered 27
What do we mean by serverless? Any cloud service where the details and operations of the server are not exposed to the user of the services Lambda Kinesis DynamoDB S3 Athena Not EC2 28
Where would you use this In implementation example, administration is a large inhibitor of success and development speed Even with Hadoop distributions and support, getting everything installed and configured correctly is a large effort Server administration is still a big part of hadoop administration OS updates OS level security Space concerns (log files get out of hand) Support is costly, and the hours spent on administration are costly Capacity planning, especially if cluster on-site Scaling based off of load Serverless potentially alleviates these issues 29
Serverless Example Implementation 1. Records written to a Kinesis Firehose delivery stream. Firehose batches up records and puts them in S3 Kinesis Firehose S3 Relational Database.cfg.cfg.csv Delivery stream Bucket 30
Serverless Example Implementation 2. Lambda function triggers on write to S3, flattens the records, and pushes them to a kinesis stream S3 Lambda Kinesis Stream Trigger Flattened rows Bucket Flatten 31
Serverless Example Implementation 3. Kinesis analytics is used to write a sample to S3 and to run the query Kinesis Analytics Kinesis Firehose S3 Kinesis Stream Query Delivery stream Bucket Sample Delivery stream Bucket 32
Serverless Example Implementation 4. Athena is used for ad-hoc query and analysis S3 33
34 2017 Think Big, a Teradata Company 2/1/17