1

Similar documents
/ Cloud Computing. Recitation 6 October 2 nd, 2018

1

/ Cloud Computing. Recitation 7 October 10, 2017

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

/ Cloud Computing. Recitation 8 October 18, 2016

CS / Cloud Computing. Recitation 7 October 7 th and 9 th, 2014

/ Cloud Computing. Recitation 13 April 12 th 2016

/ Cloud Computing. Recitation 2 September 5 & 7, 2017

/ Cloud Computing. Recitation 9 March 17th and 19th, 2015

/ Cloud Computing. Recitation 7 February 24th & 26th, 2015

/ Cloud Computing. Recitation 9 March 15th, 2016

Introduction to NoSQL Databases

/ Cloud Computing. Recitation 10 March 22nd, 2016

CS / Cloud Computing. Recitation 9 October 22 nd and 25 th, 2013

/ Cloud Computing. Recitation 2 January 19 & 21, 2016

/ Cloud Computing. Recitation 5 February 14th, 2017

CIB Session 12th NoSQL Databases Structures

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

NoSQL Databases Analysis

<Insert Picture Here> MySQL Cluster What are we working on

/ Cloud Computing. Recitation 13 April 14 th 2015

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

SQLite vs. MongoDB for Big Data

Introduction to Database Services

/ Cloud Computing. Recitation 5 September 27 th, 2016

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

CSE 344 JULY 9 TH NOSQL

Big Data Hadoop Course Content

Distributed Systems. 29. Distributed Caching Paul Krzyzanowski. Rutgers University. Fall 2014

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Scott Meder Senior Regional Sales Manager

SCALABLE WEB PROGRAMMING. CS193S - Jan Jannink - 2/02/10

/ Cloud Computing. Recitation 5 September 26 th, 2017

STATE OF MODERN APPLICATIONS IN THE CLOUD

CSE 530A. Non-Relational Databases. Washington University Fall 2013

/ Cloud Computing. Recitation 8 March 1 st, 2016

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

An Introduction to Big Data Formats

DATABASE DESIGN II - 1DL400

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Strategies for Incremental Updates on Hive

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Distributed Data Store

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Cloud Computing & Visualization

Informatica Power Center 10.1 Developer Training

Prototyping Data Intensive Apps: TrendingTopics.org

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Tour of Database Platforms as a Service. June 2016 Warner Chaves Christo Kutrovsky Solutions Architect

COMP9321 Web Application Engineering

Advanced Database Project: Document Stores and MongoDB

Relational databases

Part 1: Indexes for Big Data

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Database Acceleration Solution Using FPGAs and Integrated Flash Storage

Key Value Store. Yiding Wang, Zhaoxiong Yang

Database Systems CSE 414

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Provide Real-Time Data To Financial Applications

Using space-filling curves for multidimensional

Approaching the Petabyte Analytic Database: What I learned

Manual Trigger Sql Server 2008 Examples Insert Update

Getting to know. by Michelle Darling August 2013

Database Architectures

Datacenter replication solution with quasardb

RethinkDB. Niharika Vithala, Deepan Sekar, Aidan Pace, and Chang Xu

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Etlworks Integrator cloud data integration platform

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

CloudExpo November 2017 Tomer Levi

CIT 668: System Architecture. Amazon Web Services

Lambda Architecture for Batch and Stream Processing. October 2018

Big Data Analytics. Rasoul Karimi

Introduction to NoSQL

Percona Server for MySQL 8.0 Walkthrough

BIG DATA COURSE CONTENT

Migrate from Netezza Workload Migration

What is database? Types and Examples

QLIK INTEGRATION WITH AMAZON REDSHIFT

A Fast and High Throughput SQL Query System for Big Data

Tungsten Replicator for Kafka, Elasticsearch, Cassandra

Index. Bessel function, 51 Big data, 1. Cloud-based version-control system, 226 Containerization, 30 application, 32 virtualize processes, 30 31

A Global In-memory Data System for MySQL Daniel Austin, PayPal Technical Staff

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

MongoDB Schema Design for. David Murphy MongoDB Practice Manager - Percona

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Amazon AWS-Solution-Architect-Associate Exam

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Hive Metadata Caching Proposal

Why NoSQL? Why Riak?

A Glimpse of the Hadoop Echosystem

Transcription:

1

2

3

6

7

8

9

10

Storage & IO Benchmarking Primer Running sysbench and preparing data Use the prepare option to generate the data. Experiments Run sysbench with different storage systems and instance types. Doing this multiple times to reveal different behaviors and results. Compare the requests per second. 11

Scenario Instance Type Storage Type RPS Range RPS Increase Across 3 Iterations 1 t2.micro EBS Magnetic Storage 169.13, 171.81, 181.31 Trivial (< 5%) 2 t2.micro EBS General Purpose SSD 1465.00, 1473.33, 1490.93 Trivial (< 5%) 3 m4.large EBS Magnetic Storage 527.70, 973.63, 1246.67 Significant (can reach ~140% increase with an absolute value of 450-700) 4 m4.large EBS General Purpose SSD 2046.66, 2612.00, 2649.66 Noticeable (can reach ~30% increase with an absolute value of 500-600) What can you conclude from these results? 12

SSD has better performance than magnetic disk m4.large instance has better performance than a t2.micro instance The RPS increase across 3 iterations for m4.large is more significant than that for t2.micro: The reason is an instance with more memory can cache more of the previous requests for repeated tests. Caching is a vital performance tuning mechanism when building high performance applications. 13

P3.1: Files, SQL, and NoSQL: Task 1: analyze data in flat files Linux tools (e.g. grep, awk) Data libraries (e.g. pandas) Task 2: Explore a SQL database (MySQL) load data, run queries, indexing, auditing plain-sql v/s ORM Task 3: Implement a prototype of Redis Task 4: Explore a NoSQL DB (HBase) load data, run basic queries The NoSQL and HBase primers are vital for P3.1 14

Flat files, plain text or binary comma-separated values (CSV) format: Carnegie,Cloud Computing,A,2018 tab-separated values (TSV) format: Carnegie\tCloud Computing\tA\t2018 a custom and verbose format: Name: Carnegie, Course: Cloud Computing, Section: A, Year: 2018 15

Lightweight, Flexible, in favor of small tasks Run it once and throw it away Performing complicated analysis on data in files can be inconvenient Usually flat files should be fixed or append-only Writes to files without breaking data integrity is difficult Managing the relations among multiple files is also challenging 16

A collection of organized data Database management system (DBMS) Interface between user and data Store/manage/analyze data Relational databases Based on the relational model (schema) NoSQL Databases Unstructured/semi-structured Redis, HBase, MongoDB, Neo4J 17

Advantages Logical and physical data independence Concurrency control and transaction support Query the data easily (e.g. SQL)... Disadvantages Cost (computational resources, fixed schema) Maintenance and management Complex and time-consuming to design schema... 18

Compare flat files to databases Think about: What are the advantages and disadvantages of using flat files or databases? In what situations would you use a flat file or a database? How to design your own database? How to load, index and query data in a database? 19

Analyze Yelp s Academic Dataset https://www.yelp.com/dataset_challenge Answer questions in runner.sh Use tools such as awk, grep, pandas Similar to what you did in projects 1.1, 1.2 Merge TSV files by joining on a common field Identify the disadvantages of flat files 20

Prepare tables A script to create the table and load the data is already provided Use MySQL queries to answer questions Learn JDBC Complete MySQLTasks.java Aggregate functions, joins Statement and PreparedStatement SQL injection Learn how to use proper indexes to improve performance 21

Schema The structure of the tables and the relations between tables Based on the structure of the data Index: A data structure that helps speed up the retrieval of data from tables (B-Tree, Hash indexes, etc) Based on the data as well as queries You can build effective indexes only if you are aware of the queries you need. We have an insightful section about the practice of indexing, read it carefully! 22

How do we evaluate the performance of a query? Run it. What if we want/need to predict the performance without execution? Use EXPLAIN statements. An EXPLAIN statement on a query will predict: the number of rows to scan whether it makes use of indexes or not etc. 23

ORM is a technique to abstract away the work for you to: 1. Map the domain class with the database table 2. Map each field of the domain class with a column of the table 3. Map instances of the classes (objects) with rows in the corresponding tables 24

Separation of concerns. ORM decouples the CRUD operations and the business logic code. Productivity. You don t need to keep switching between your OOP language such as Java/Python, etc and SQL. Flexibility to meet evolving business requirements. ORM cannot eliminate the schema update problem, but it may ease the difficulty, esp. when used together with data migration tools. Persistence transparency. Any changes to a persistent object will be automatically propagated to the database without explicit SQL queries. Vendor independence. ORM can abstract the application away from the underlying SQL database and SQL dialect. 25

The current business application exposes an API that returns the most popular Pittsburgh businesses. It is based on a SQLite3 database with an outdated schema. Your task is to plug the business application to the MySQL database and update the definition of the domain class to match the new schema. The API will be backward compatible without modifying any business logic code. 26

CAP The CAP theorem: it is impossible for a distributed data store to provide all the following three guarantees at the same time Consistency: no stale data Availability: no downtime Partition Tolerance: network failure tolerance in a distributed system CA: non-distributed (MySQL, PostgreSQL) CP: downtime (HBase, MongoDB) AP: stale data (Amazon DynamoDB) 27

Non-SQL(Non-relational) or NotOnly-SQL Why NoSQL if we already have SQL solutions? Traditional SQL Databases do not always align with the needs of distributed systems. Flexible Data Model Basic Types of NoSQL Databases Schema-less Key-Value Stores (Redis) Wide Column Stores (Column Family Stores) (HBase) Document Stores (MongoDB) Graph DBMS (Neo4j) 28

Key value store is a type of NoSQL database e.g. Redis and Memcached Widely used as an in-memory cache Your task is to implement a simplified version of Redis We provide you with starter code Redis.java with operations unimplemented, you will implement two of the commonly used data structures supported by Redis: hashes and lists 29

Launch an EMR cluster with HBase installed. Follow the write-up to download and load the data into HBase. Try different querying commands in the HBase shell. Complete HBaseTasks.java using HBase Java APIs. 30

Tag your resources with: Key: Project, Value: 3.1 Stop the submitter instance to take a break and work on it later. Terminate the EMR cluster. Make sure to terminate the instance after finishing all questions and submitting your answers. 31

twitter DATA ANALYTICS: TEAM PROJECT

Team Project 33

Team Project Phase 1: Q1 Q2 (MySQL AND HBase) Input your team account ID and GitHub username on TPZ Phase 2 Q1 Q2 & Q3 (MySQL AND HBase) Phase 3 Q1 Q2 & Q3 & Q4 (MySQL OR HBase) 34

Team Project System Architecture Web server architectures Dealing with large scale real world tweet data HBase and MySQL optimization 35

Team Project Deadlines Writeup and queries were released on Monday, Feb 26th, 2018. Phase 1 milestones: Checkpoint 1: Report, due on Sunday, 3/11 Checkpoint 2: Q1 on scoreboard, due on Sunday, 3/25 Phase 1 Deadline: Q2 on scoreboard, due on Sunday, 4/1 Phase 1, code and report: due on Tuesday, 4/3 36

Git workflow Commit your code to the private repo we set up Update your GitHub username in TPZ! Make changes on a new branch Work on this branch, commit as you wish Open a pull request to merge into the master branch Code review Someone else needs to review and accept (or reject) your code changes This process will allow you to capture bugs and remain informed on what others are doing 37

Query 1, QR code Query 1 does not require a database Implement encoding and decoding of QR code A simplified version of QR You must explore different web frameworks Get at least 2 different web frameworks working Select the framework with the better performance Provide evidence of your experimentation 38

Query 1 Specification Structural components Two sizes of QR, depending on the message length Position detection patterns Alignment patterns Timing patterns Fill in a message Interleave chars & error detection codes Put in the payload, move in an S shape

Query 1 Example Encoding For a message, return the image encoded as a hex string The area in the red box in the image below Decoding Find the encoded image in a random background The image may be rotated 0, 90,180, or 270 degrees Use patterns to find the QR code! Then do the inverse of encoding Sorry, you can t scan it with a camera yet :(

Query 2, Hashtag Recommendation Query 2 is a DB read query Given a sentence, return hashtags that appeared most frequently with words in it You have to perform Extract, Transform and Load (ETL) Can be done on AWS, Azure or GCP Dataset is available on all three cloud platforms Pay close attention to the ETL rules and related definitions Make good use of the reference data & server In this query, you will need both front end + back end MySQL and HBase You need two 10-minute submissions, (Q2M & Q2H) by the deadline 41

Query 2 Example Assume that the dataset contains user 123456789: Cloud is great #aws user 987654321: Cloud computing is tough #azure Sample request and response Request: keywords=cloud,computing&userid=123456789&n=3 #aws: 2 points; #azure: 2 points Response: #aws,#azure Return top n hashtags (or all), break ties lexicographically 42

Phase (and query due) Start Deadline Code and Report Due 43

44

45