15-319 / 15-619 Cloud Computing Recitation 8 October 18, 2016 1
Overview Administrative issues Office Hours, Piazza guidelines Last week s reflection Project 3.2, OLI Unit 3, Module 13, Quiz 6 This week s schedule - Quiz 7 - Thursday, October 20th - Unit 4, Module 14 - Project 3.3 - October 23 Team Project: Phase 1 2
Last Week : A Reflection Content, Unit 3 - Module 13: - Storage and Network Virtualization - Quiz 6 completed P3.2: You explored consistency models - Sharding and Replication - Multithreaded programming - Implemented Strong consistency model - Bonus Task: Eventual Consistency 3
This Week: Content UNIT 4: Cloud Storage Module 14: Cloud Storage Quiz 7 - Introduction to Cloud Storage Thursday, October 20, 2016 Module 15: Case Studies: Distributed File Systems Quiz 8: Distributed File Systems Checkpoint Module 16: Case Studies: NoSQL Databases Module 17: Case Studies: Cloud Object Storage Quiz 9: NoSQL and Object Stores
Project 3.2 Feedback https://goo.gl/qpz7ou Please leave us feedback 5
Project 3 Weekly Modules P3.1: Files, SQL and NoSQL Primer: Storage Benchmarking P3.2: Replication and Consistency models Primer: Intro. to Java Multithreading Primer: Thread-safe programming Primer: Intro. to Consistency Models P3.3: Social network with heterogeneous backend storage
Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system Response was a highly available key-value structured storage system called Dynamo (2007) Problem Technique used as solution Data Sharding Consistent Hashing Transient Fault Handling Sloppy Quorum / Hinted Handoff Permanent Failure Recovery Anti-entropy using Merkle trees Membership and Health Checks Gossip protocols Used in S3, DynamoDB, Cassandra Article on DynamoDB - By Werner Vogels 7
Distributed Databases In 2006, Google published details about their implementation of BigTable Designed as a sparse, distributed multidimensional sorted map HBase stores members of column families adjacent to each other on the file system columnar data store 8
Project 3.3 Review
Project 3.3 : Introduction Build a social network about movies: 10
High Fanout and Multiple Rounds of Data Fetching A single Facebook page, requires many data fetch operations Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C.,... & Venkataramani, V. (2013, April). Scaling Memcache at Facebook. In nsdi (Vol. 13, pp. 385-398).
P3.3 Data Set 1. User Profiles 1. 2. User Authentication System (such as a Single-Sign-On or SSO) - RDS MySQL 2. User Info / Profile - RDS MySQL 3. Action Log 4. Social Graph of the User: follower, followee, family etc. HBase User Activity System - All user generated media - MongoDB 3. Big Data Analytics System 1. 2. 3. Search System Recommender System User Behaviour Analysis
Project 3.3 : Architecture Build a social network about movies: HBase Front-end Server Back-end Server S3 MongoDB MySQL (RDS) 13
MongoDB Document Database Schema-less model Scalable Automatically shards data among multiple servers Does load-balancing Complex Queries MapReduce style filter and aggregations Geo-spatial queries
Project 3.3 : Tasks Build a social network about movies: 15
Project 3.3 : Task 5 Friend recommendation 16
Twitter Analytics Team Project 17
twitter DATA ANALYTICS: 15619 PROJECT
Team Project System Architecture Web server architectures Dealing with large scale real world tweet data HBase and MySQL optimization
Team Project Phase 1: Q1 Q2 (MySQL AND HBase) CONFIRM YOUR AWS ACCOUNT AND TEAM INFO Phase 2 Q1 Q2 & Q3 (MySQL AND HBase) Phase 3 Q1 Q2, Q3 & Q4 (MySQL OR HBase)
Team Project Time Table Phase (and query due) Start Deadline Phase 1 Q1 Monday 10/10/2016 00:00:01 EST Sunday 10/23/2016 23:59:59 ET Q2 Sunday 10/30/2016 23:59:59 ET Phase 2 Q1, Q2, Q3 Monday 10/31/2016 00:00:01 ET Sunday 11/13/2016 15:59:59 ET Phase 2 Live Test (Hbase/MySQL) Q1, Q2, Q3 Sunday 11/13/2016 18:00:01 ET Sunday 11/13/2016 23:59:59 ET Phase 3 Q1, Q2, Q3, Q4 Monday 11/14/2016 00:00:01 ET Sunday 12/04/2016 15:59:59 ET Phase 3 Live Test Q1, Q2, Q3, Q4 Sunday 12/04/2016 18:00:01 ET Sunday 12/04/2016 23:59:59 ET Code and Report Due Tuesday 11/01/2016 23:59:59 ET Tuesday 11/15/2016 23:59:59 ET Tuesday 12/06/2016 23:59:59 ET Note: There will be a report due at the end of each phase, where you are expected to discuss optimizations WARNING: Check your AWS instance limits on the new account (should be > 10 instances)
Team Project Phase 1 Two queries Q1: Pure front end Q2: ETL + back end + front end, do both MySQL (relational DBMS) and HBase (NoSQL) Grading Submit on TPZ, you will get several numbers: Error Rate, Correctness and RPS Higher RPS, higher correctness, lower error rate higher grade Q1 is 25% of phase 1, Q2 MySQL is 25% of phase 1, Q2 HBase is 25% of phase 1, report is 25% of phase 1 22
Team Project, Phase 1, Q1 Step 1: Compare different front-end frameworks Step 2: Deploy the front-end Step 3: Perform decryption of a secret message Pure front end, no database needed. Need to consider scaling horizontally
Team Project, Phase 1, Q2 Step 1: Extract tabular data from raw tweets Input file: JSON Tweets (approx. 1 TB) Consider using a MapReduce Job for ETL ETL is expensive and there s the potential for errors, so plan carefully, test on smaller data sets Start early, or no time to optimize the backend Step 2: Load the data into HBase and MySQL (both!) Step 3: Deploy a web service for handling HTTP requests, responds with data from the backend anhigher optimized backend (MySQL andpoints HBase) throughput = More Winner gets grades, fame (?), job (?)
Common Q2 issues Unicode اﻟﺣوﺳﺑﺔ اﻟﺳﺣﺎﺑﯾﺔ ब दल क य ट ग 云计算 クラウドコ ンピューティング ಕ ಪ וואָלקן קאַמפּיוטינ ג облачныхвычислений Emojis Remember to do short URLs elimination
Hints Read the write-up carefully (read more than once) You can test only if you have a front end ETL has many corner cases, can be time consuming and expensive Start early (from the first day), your backend will be meaningless if you have incorrect data The reference server and the reference ETL file are your friends Big data challenge will easily eat up your time and money if you are careless. Think, calculate, & test before you launch an EMR cluster with 20 machines 26
Reminder Changes in Team Project writeup. Refer @1616 Updated banned word list. Refer @1729 You have a total budget of $50 for Phase 1 Your system should not cost more than $0.95 per hour, this includes (see write-ups for details): EC2 on demand instance cost even if you use spot instances, we will calculate your cost using the on-demand instance price EBS cost ELB cost Target: Q2-3000 rps (for both MySQL and HBase) 27
Start early! Team Project Q1 Also Due Sunday
Upcoming Deadlines Quiz 7: Unit 4 - Module 14 - Cloud Storage Due: Thursday, 10/20/2016 11:59PM Pittsburgh Project 3.3: Social Networking Timeline with Heterogeneous Backends Due: 10/23/2016 11:59PM Pittsburgh Team Project: Phase 1 - Query 1, (This Sunday, Oct 23!) Due: 10/23/2016 11:59PM Pittsburgh Team Project: Phase 1 - Query 2 Due: 10/30/2016 11:59PM Pittsburgh
Q&A