/ Cloud Computing. Recitation 10 March 22nd, 2016

Similar documents
/ Cloud Computing. Recitation 8 October 18, 2016

/ Cloud Computing. Recitation 9 March 15th, 2016

/ Cloud Computing. Recitation 7 October 10, 2017

/ Cloud Computing. Recitation 9 March 17th and 19th, 2015

/ Cloud Computing. Recitation 13 April 12 th 2016

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

/ Cloud Computing. Recitation 8 March 1 st, 2016

/ Cloud Computing. Recitation 6 October 2 nd, 2018

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

/ Cloud Computing. Recitation 13 April 14 th 2015

CS / Cloud Computing. Recitation 7 October 7 th and 9 th, 2014

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

/ Cloud Computing. Recitation 5 February 14th, 2017

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 7 February 24th & 26th, 2015

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

CS / Cloud Computing. Recitation 9 October 22 nd and 25 th, 2013

CISC 7610 Lecture 2b The beginnings of NoSQL

CIB Session 12th NoSQL Databases Structures

1

/ Cloud Computing. Recitation 5 September 26 th, 2017

/ Cloud Computing. Recitation 13 April 17th 2018

Challenges for Data Driven Systems

1

/ Cloud Computing. Recitation 5 September 27 th, 2016

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

2

Big Data Analytics using Apache Hadoop and Spark with Scala

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

CS 655 Advanced Topics in Distributed Systems

/ Cloud Computing. Recitation 15 December 6 th 2016

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

A Fast and High Throughput SQL Query System for Big Data

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to NoSQL Databases

QLIK INTEGRATION WITH AMAZON REDSHIFT

Ghislain Fourny. Big Data 5. Column stores

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Dynamo: Amazon s Highly Available Key-Value Store

CS / Cloud Computing. Recitation 8 October 14 th and 16 th, 2014

Cassandra Design Patterns

Big Data Hadoop Course Content

Chapter 24 NOSQL Databases and Big Data Storage Systems

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Introduction to Graph Databases

Why NoSQL? Why Riak?

Stages of Data Processing

Getting to know. by Michelle Darling August 2013

Cloud Analytics and Business Intelligence on AWS

FAQs Snapshots and locks Vector Clock

6.830 Lecture Spark 11/15/2017

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Data-Intensive Distributed Computing

Using space-filling curves for multidimensional

Ghislain Fourny. Big Data 5. Wide column stores

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Non-Relational Databases. Pelle Jakovits

/ Cloud Computing. Recitation 2 January 19 & 21, 2016

10. Replication. Motivation

Column-Family Databases Cassandra and HBase

Presented by Sunnie S Chung CIS 612

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Evolving To The Big Data Warehouse

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Scaling for Humongous amounts of data with MongoDB

CS 6343: CLOUD COMPUTING Term Project

Exploring Cassandra and HBase with BigTable Model

DATABASE DESIGN II - 1DL400

Big Data Architect.

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

Review - Relational Model Concepts

Study of NoSQL Database Along With Security Comparison

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

September 2013 Alberto Abelló & Oscar Romero 1

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Cassandra- A Distributed Database

MongoDB Schema Design

A Non-Relational Storage Analysis

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion

CompSci 516 Database Systems

Rails on HBase. Zachary Pinter and Tony Hillerson RailsConf 2011

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

ITP 342 Mobile App Development. APIs

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

Transcription:

15-319 / 15-619 Cloud Computing Recitation 10 March 22nd, 2016

Overview Administrative issues Office Hours, Piazza guidelines Last week s reflection Project 3.3, OLI Unit 4, Module 15, Quiz 8 This week s schedule 15619 Project Phase 2- March 30th Quiz 9 March 25th (Unit 4, Module 16 & 17) Project 3.4 March 27th 2

Reminders Monitor AWS expenses regularly and tag all resources Check your bill (Cost Explorer > filter by tags). Piazza Guidelines Please tag your questions appropriately Search for an existing answer first Provide clean, modular and well documented code Large penalties for not doing so. Utilize Office Hours We are here to help (but not to give solutions) Use the team AWS account and tag the 15619Project resources carefully. 3

Reflection on P3.2 and P3.3 Feedback on code submission While grading P3.2 & P3.3 this week, we have found that many students are not properly implementing thread-safe access to shared data structures. Review @2129 4

OLI Unit 4 Review

OLI Unit 4 : Review M14 : Cloud Storage Overview M15 : Distributed File Systems (HDFS & Ceph) M16 : NoSQL Database Case Studies HBase, MongoDB, Cassandra, DynamoDB M17: Cloud Object Storage 6

Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system Response was a highly available key-value structured storage system called Dynamo (2007) Problem Technique used as solution Data Sharding Consistent Hashing Transient Fault Handling Sloppy Quorum / Hinted Handoff Permanent Failure Recovery Anti-entropy using Merkle trees Membership and Health Checks Gossip protocols Used in S3, DynamoDB, Cassandra 7

Distributed Databases In 2006, Google published details about their implementation of BigTable Designed as a sparse, distributed multidimensional sorted map HBase stores members of column families adjacent to each other on the file system columnar data store 8

Upcoming Deadlines Quiz 9 : Unit 4 - Modules 16, 17 Due : 3/25/2016 11:59 PM Pittsburgh Project 3.4 : Social Network with Heterogeneous DBs Due : 3/27/2016 11:59 PM Pittsburgh 15619Project : Phase 2 Due : 3/30/2016 3:59 PM Pittsburgh

Project 3 Review

Project 3 Weekly Modules P3.1: Files, SQL and NoSQL P3.2: Sharding and Replication P3.3: Consistency P3.4: Social network with heterogeneous backends P3.5: Data warehousing and OLAP

Project 3.4 : Introduction Build a social network about movies: 12

High Fanout and Multiple Rounds of Data Fetching A single Facebook page, requires many data fetch operations Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C.,... & Venkataramani, V. (2013, April). Scaling Memcache at Facebook. In nsdi (Vol. 13, pp. 385-398).

P3.4 Data Set 1. User Profiles 1. 2. User Authentication System (such as a Single-Sign-On or SSO) - RDS MySQL 2. User Info / Profile - RDS MySQL 3. Action Log 4. Social Graph of the User: follower, followee, family etc. HBase User Activity System - All user generated media - MongoDB 3. Big Data Analytics System 1. 2. 3. Search System Recommender System User Behaviour Analysis

Project 3.4 : Architecture Build a social network about movies: HBase Front-end Server Back-end Server S3 MySQL (RDS) MongoDB 15

MongoDB Document Database Schema-less model Scalable Automatically shards data among multiple servers Does load-balancing Complex Queries MapReduce style filter and aggregations Geo-spatial queries

Heterogeneous Backends

Project 3.4 : Tasks Build a social network about movies: 18

Project 3.4 : Bonus Task Friend recommendation 19

Upcoming Deadlines Quiz 9 : Unit 4 - Modules 16, 17 Due : 3/25/2016 11:59 PM Pittsburgh Project 3.4 : Social Network with Heterogeneous DBs Due : 3/27/2016 11:59 PM Pittsburgh 15619Project : Phase 2 Due : 3/30/2016 3:59 PM Pittsburgh

twitter DATA ANALYTICS: 15619 PROJECT

Phase 1 Leaderboard Well done!!! Congratulations skyleg & Sugoyi

15619 Project Time Table Phase (and query due) Start Deadline Code and Report Due Phase 1 Part 1 Q1, Q2 Thursday 02/25/2016 00:00:01 EST Wednesday 03/16/2016 23:59:59 EDT Thursday 03/17/2016 23:59:59 EDT Phase 2 Q1, Q2, Q3 Thursday 03/17/2016 00:00:01 EDT Wednesday 03/30/2016 15:59:59 EDT Phase 2 Live Test (Hbase/MySQL) Q1, Q2, Q3 Wednesday 03/30/2016 18:00:01 EDT Wednesday 03/30/2016 23:59:59 EDT Phase 3 Q1, Q2, Q3, Q4 Thursday 03/31/2016 00:00:01 EDT Wednesday 04/13/2016 15:59:59 EDT Phase 3 Live Test Q1, Q2, Q3, Q4 Wednesday 04/13/2016 18:00:01 EDT Wednesday 04/13/2016 23:59:59 EDT Thursday 03/31/2016 23:59:59 EDT Thursday 04/13/2016 23:59:59 EDT Note: There will be a report due at the end of each phase, where you are expected to discuss design, exploration and optimizations.

15619 Project Phase 2 Deadlines 15619 Project Phase1 15619 Project Phase 2 Code & Report are Due Live test (6pm EDT) Phase 1 Due 03/16/2016 23:59:59 EDT Phase 1 Report Due Phase 2 Start 03/17/2016 23:59:59 EDT Phase 2 Due 03/30/2016 15:59:59 EDT 15619 Project Phase 2 Q1, Q2 & Q3 Phase 2 Report Due Phase 3 Start 03/31/2015 23:59:59 EDT

Phase 2 One new query (Q1, Q2 and Q3) More ETL Multiple tables and queries Live Test!!! Both HBase and MySQL Two DNS Includes Mixed-Load

Hints - MySQL System Environment Storage Medium Storage Engine Character set Import data (SHOW WARNINGS) Indexing Profiling/Optimization EXPLAIN SET PROFILING=1 htop, iotop

Hints - HBase Loading data: Pig, thrift, MapReduce HBase schema: GET is much faster than SCAN How to design rowkey? HBase cluster: Cloudera Manager - easy deployment and management of cluster Deploy your own HBase cluster and automate it Using EMR will lead to higher cost must use less instances <$.85 HBase configuration tuning: Heap size, cache size Block size Region size number The book: http://shop.oreilly.com/product/0636920014348.do

General Tips Don t blindly optimize for every component, identify the bottlenecks using fine-grained profiling Use caches wisely: cache in HBase and MySQL is obviously important, but front-end cache will most likely to fail during the Live test Get the whole picture of the database you are using, don t just Google and adopt HBase/MySQL optimization techniques blindly. Review what we have learned in previous project modules Scale out Load balancing Replication and sharding Look at the feedback of your Phase 1 report!

Q3 : Calculate word occurrences in tweet text within a certain user id range and a date range. (Two-range query) Request Format GET/q3?start_date=yyyy-mm-dd&end_date=yyyy-mmdd&start_userid=uid&end_userid=uid&words=w1,w2,w3 Response Format TEAMID,TEAM_AWS_ACCOUNT_ID\n w1:count1\n w2:count2\n W3:count3\n Target RPS 6000

Q3 : Request Example (Double Range Query) GET/q3?start_date=2014-04-01&end_date=2014-0528&start_userid=51538630&end_userid=51539182&words=u,petition,loving Response Format Team,1234-5678-1234 u:7\n petition:2\n loving:5\n

Q3: ETL 1. Filter out non-english tweets (lang!= en ) 2. Split words when a non-alphanumeric character ([^a-za-z0-9]) is encountered. Use the regular expression provided in the write-up. 2. Words are case INSENSITIVE in word count. 3. Banned words in Q2 will not appear in Q3 requests. 4. Ignore words from stop words list. 5. Match your ETL result with the reference file provided. We will provide a Q3 reference file and reference server.

Q3 Hints EValuate the Q3 functionality when designing your schema, especially for HBase. Your schema design should avoid a big scan. Try to get an idea of the size of user id ranges and date ranges in the requests.

Phase 2 Live Test HBase LiveTest Time Value Target Weight 6:00 pm - 6:30 pm Warm-up (Q1 only) - 0% 6:30 pm - 7:00 pm Q1 27000 5% 7:00 pm - 7:30 pm Q2 10000 10% 7:30 pm - 8:00 pm Q3 6000 10% 8:00 pm - 8:30 pm Mixed Reads(Q1,Q2, Q3) TBD 5+5+5 = 15% Half Hour Break MySQL LiveTest Time Value Target Weight

Tips for Phase 2 Carefully design and check ETL process Watch your budget: $60 = Phase 2 + Live Test Preparing for the live test You are required to submit two URLs, one for the MySQL live test and one for the HBase for live test Budget limited to $.85/hr for MySQL and HBase web service separately. Need to have all Q1-Q3 running at the same time. Don t expect testing in sequence. Queries will be mixed.

Upcoming Deadlines Quiz 9 : Unit 4 - Modules 16, 17 Due: 3/25/2016 11:59 PM Pittsburgh Project 3.4 : Social Network with Heterogeneous DBs Due: 3/27/2016 11:59 PM Pittsburgh 15619Project : Phase 2 Due: 3/30/2016 3:59 PM Pittsburgh