/ Cloud Computing. Recitation 8 October 18, 2016

Similar documents
/ Cloud Computing. Recitation 10 March 22nd, 2016

/ Cloud Computing. Recitation 7 October 10, 2017

/ Cloud Computing. Recitation 8 March 1 st, 2016

/ Cloud Computing. Recitation 9 March 17th and 19th, 2015

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

/ Cloud Computing. Recitation 9 March 15th, 2016

CS / Cloud Computing. Recitation 8 October 14 th and 16 th, 2014

/ Cloud Computing. Recitation 6 October 2 nd, 2018

/ Cloud Computing. Recitation 13 April 12 th 2016

/ Cloud Computing. Recitation 5 February 14th, 2017

/ Cloud Computing. Recitation 5 September 27 th, 2016

CS / Cloud Computing. Recitation 7 October 7 th and 9 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 9 October 22 nd and 25 th, 2013

/ Cloud Computing. Recitation 5 September 26 th, 2017

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

1

1

/ Cloud Computing. Recitation 7 February 24th & 26th, 2015

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 13 April 14 th 2015

CIB Session 12th NoSQL Databases Structures

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

/ Cloud Computing. Recitation 13 April 17th 2018

CS 655 Advanced Topics in Distributed Systems

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Challenges for Data Driven Systems

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Chapter 24 NOSQL Databases and Big Data Storage Systems

Non-Relational Databases. Pelle Jakovits

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

/ Cloud Computing. Recitation 15 December 6 th 2016

/ Cloud Computing. Recitation 2 January 19 & 21, 2016

Why NoSQL? Why Riak?

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Migrating to Cassandra in the Cloud, the Netflix Way

A Non-Relational Storage Analysis

Dynamo: Amazon s Highly Available Key-Value Store

Distributed Systems Intro and Course Overview

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

Dynamo: Amazon s Highly Available Key-value Store

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

DATABASE DESIGN II - 1DL400

CISC 7610 Lecture 2b The beginnings of NoSQL

Large-Scale Web Applications

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

NoSQL Database Comparison: Bigtable, Cassandra and MongoDB CJ Campbell Brigham Young University October 16, 2015

Getting to know. by Michelle Darling August 2013

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

Beating the Final Boss: Launch your game!

2013 AWS Worldwide Public Sector Summit Washington, D.C.

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

/ Cloud Computing. Recitation 2 September 5 & 7, 2017

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

FAQs Snapshots and locks Vector Clock

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

Advanced Database Technologies NoSQL: Not only SQL

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

CA485 Ray Walshe NoSQL

The NoSQL Ecosystem. Adam Marcus MIT CSAIL

Study of NoSQL Database Along With Security Comparison

Cassandra Design Patterns

Advanced Systems Lab (Intro and Administration) G. Alonso Systems Group

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Principal Solutions Architect. Architecting in the Cloud

CSE 344 JULY 9 TH NOSQL

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

BIG DATA AND CONSISTENCY. Amy Babay

Distributed Databases: SQL vs NoSQL

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Column-Family Databases Cassandra and HBase

Intro To Big Data. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

10. Replication. Motivation

Intro Cassandra. Adelaide Big Data Meetup.

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Introduction to NoSQL

CS / Cloud Compu1ng. Recita1on 8 March 4 th and 6 th, 2014

MongoDB in AWS (MongoDB as a DBaaS)

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

Tools for Social Networking Infrastructures

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Review - Relational Model Concepts

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Stages of Data Processing

amazon.com s Journey to the Cloud Jon Jenkins AWS Summit June 13, 2011

Presented by Sunnie S Chung CIS 612

CompSci 516 Database Systems

A Fast and High Throughput SQL Query System for Big Data

Accelerating NoSQL. Running Voldemort on HailDB. Sunny Gleason March 11, 2011

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Latest Trends in Database Technology NoSQL and Beyond

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME. Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

Transcription:

15-319 / 15-619 Cloud Computing Recitation 8 October 18, 2016 1

Overview Administrative issues Office Hours, Piazza guidelines Last week s reflection Project 3.2, OLI Unit 3, Module 13, Quiz 6 This week s schedule - Quiz 7 - Thursday, October 20th - Unit 4, Module 14 - Project 3.3 - October 23 Team Project: Phase 1 2

Last Week : A Reflection Content, Unit 3 - Module 13: - Storage and Network Virtualization - Quiz 6 completed P3.2: You explored consistency models - Sharding and Replication - Multithreaded programming - Implemented Strong consistency model - Bonus Task: Eventual Consistency 3

This Week: Content UNIT 4: Cloud Storage Module 14: Cloud Storage Quiz 7 - Introduction to Cloud Storage Thursday, October 20, 2016 Module 15: Case Studies: Distributed File Systems Quiz 8: Distributed File Systems Checkpoint Module 16: Case Studies: NoSQL Databases Module 17: Case Studies: Cloud Object Storage Quiz 9: NoSQL and Object Stores

Project 3.2 Feedback https://goo.gl/qpz7ou Please leave us feedback 5

Project 3 Weekly Modules P3.1: Files, SQL and NoSQL Primer: Storage Benchmarking P3.2: Replication and Consistency models Primer: Intro. to Java Multithreading Primer: Thread-safe programming Primer: Intro. to Consistency Models P3.3: Social network with heterogeneous backend storage

Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system Response was a highly available key-value structured storage system called Dynamo (2007) Problem Technique used as solution Data Sharding Consistent Hashing Transient Fault Handling Sloppy Quorum / Hinted Handoff Permanent Failure Recovery Anti-entropy using Merkle trees Membership and Health Checks Gossip protocols Used in S3, DynamoDB, Cassandra Article on DynamoDB - By Werner Vogels 7

Distributed Databases In 2006, Google published details about their implementation of BigTable Designed as a sparse, distributed multidimensional sorted map HBase stores members of column families adjacent to each other on the file system columnar data store 8

Project 3.3 Review

Project 3.3 : Introduction Build a social network about movies: 10

High Fanout and Multiple Rounds of Data Fetching A single Facebook page, requires many data fetch operations Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C.,... & Venkataramani, V. (2013, April). Scaling Memcache at Facebook. In nsdi (Vol. 13, pp. 385-398).

P3.3 Data Set 1. User Profiles 1. 2. User Authentication System (such as a Single-Sign-On or SSO) - RDS MySQL 2. User Info / Profile - RDS MySQL 3. Action Log 4. Social Graph of the User: follower, followee, family etc. HBase User Activity System - All user generated media - MongoDB 3. Big Data Analytics System 1. 2. 3. Search System Recommender System User Behaviour Analysis

Project 3.3 : Architecture Build a social network about movies: HBase Front-end Server Back-end Server S3 MongoDB MySQL (RDS) 13

MongoDB Document Database Schema-less model Scalable Automatically shards data among multiple servers Does load-balancing Complex Queries MapReduce style filter and aggregations Geo-spatial queries

Project 3.3 : Tasks Build a social network about movies: 15

Project 3.3 : Task 5 Friend recommendation 16

Twitter Analytics Team Project 17

twitter DATA ANALYTICS: 15619 PROJECT

Team Project System Architecture Web server architectures Dealing with large scale real world tweet data HBase and MySQL optimization

Team Project Phase 1: Q1 Q2 (MySQL AND HBase) CONFIRM YOUR AWS ACCOUNT AND TEAM INFO Phase 2 Q1 Q2 & Q3 (MySQL AND HBase) Phase 3 Q1 Q2, Q3 & Q4 (MySQL OR HBase)

Team Project Time Table Phase (and query due) Start Deadline Phase 1 Q1 Monday 10/10/2016 00:00:01 EST Sunday 10/23/2016 23:59:59 ET Q2 Sunday 10/30/2016 23:59:59 ET Phase 2 Q1, Q2, Q3 Monday 10/31/2016 00:00:01 ET Sunday 11/13/2016 15:59:59 ET Phase 2 Live Test (Hbase/MySQL) Q1, Q2, Q3 Sunday 11/13/2016 18:00:01 ET Sunday 11/13/2016 23:59:59 ET Phase 3 Q1, Q2, Q3, Q4 Monday 11/14/2016 00:00:01 ET Sunday 12/04/2016 15:59:59 ET Phase 3 Live Test Q1, Q2, Q3, Q4 Sunday 12/04/2016 18:00:01 ET Sunday 12/04/2016 23:59:59 ET Code and Report Due Tuesday 11/01/2016 23:59:59 ET Tuesday 11/15/2016 23:59:59 ET Tuesday 12/06/2016 23:59:59 ET Note: There will be a report due at the end of each phase, where you are expected to discuss optimizations WARNING: Check your AWS instance limits on the new account (should be > 10 instances)

Team Project Phase 1 Two queries Q1: Pure front end Q2: ETL + back end + front end, do both MySQL (relational DBMS) and HBase (NoSQL) Grading Submit on TPZ, you will get several numbers: Error Rate, Correctness and RPS Higher RPS, higher correctness, lower error rate higher grade Q1 is 25% of phase 1, Q2 MySQL is 25% of phase 1, Q2 HBase is 25% of phase 1, report is 25% of phase 1 22

Team Project, Phase 1, Q1 Step 1: Compare different front-end frameworks Step 2: Deploy the front-end Step 3: Perform decryption of a secret message Pure front end, no database needed. Need to consider scaling horizontally

Team Project, Phase 1, Q2 Step 1: Extract tabular data from raw tweets Input file: JSON Tweets (approx. 1 TB) Consider using a MapReduce Job for ETL ETL is expensive and there s the potential for errors, so plan carefully, test on smaller data sets Start early, or no time to optimize the backend Step 2: Load the data into HBase and MySQL (both!) Step 3: Deploy a web service for handling HTTP requests, responds with data from the backend anhigher optimized backend (MySQL andpoints HBase) throughput = More Winner gets grades, fame (?), job (?)

Common Q2 issues Unicode اﻟﺣوﺳﺑﺔ اﻟﺳﺣﺎﺑﯾﺔ ब दल क य ट ग 云计算 クラウドコ ンピューティング ಕ ಪ וואָלקן קאַמפּיוטינ ג облачныхвычислений Emojis Remember to do short URLs elimination

Hints Read the write-up carefully (read more than once) You can test only if you have a front end ETL has many corner cases, can be time consuming and expensive Start early (from the first day), your backend will be meaningless if you have incorrect data The reference server and the reference ETL file are your friends Big data challenge will easily eat up your time and money if you are careless. Think, calculate, & test before you launch an EMR cluster with 20 machines 26

Reminder Changes in Team Project writeup. Refer @1616 Updated banned word list. Refer @1729 You have a total budget of $50 for Phase 1 Your system should not cost more than $0.95 per hour, this includes (see write-ups for details): EC2 on demand instance cost even if you use spot instances, we will calculate your cost using the on-demand instance price EBS cost ELB cost Target: Q2-3000 rps (for both MySQL and HBase) 27

Start early! Team Project Q1 Also Due Sunday

Upcoming Deadlines Quiz 7: Unit 4 - Module 14 - Cloud Storage Due: Thursday, 10/20/2016 11:59PM Pittsburgh Project 3.3: Social Networking Timeline with Heterogeneous Backends Due: 10/23/2016 11:59PM Pittsburgh Team Project: Phase 1 - Query 1, (This Sunday, Oct 23!) Due: 10/23/2016 11:59PM Pittsburgh Team Project: Phase 1 - Query 2 Due: 10/30/2016 11:59PM Pittsburgh

Q&A