1
2
3
6
7
8
9
10
Storage & IO Benchmarking Primer Running sysbench and preparing data Use the prepare option to generate the data. Experiments Run sysbench with different storage systems and instance types. Doing this multiple times to reveal different behaviors and results. Compare the requests per second. 11
Scenario Instance Type Storage Type RPS Range RPS Increase Across 3 Iterations 1 t2.micro EBS Magnetic Storage 169.13, 171.81, 181.31 Trivial (< 5%) 2 t2.micro EBS General Purpose SSD 1465.00, 1473.33, 1490.93 Trivial (< 5%) 3 m4.large EBS Magnetic Storage 527.70, 973.63, 1246.67 Significant (can reach ~140% increase with an absolute value of 450-700) 4 m4.large EBS General Purpose SSD 2046.66, 2612.00, 2649.66 Noticeable (can reach ~30% increase with an absolute value of 500-600) What can you conclude from these results? 12
SSD has better performance than magnetic disk m4.large instance has better performance than a t2.micro instance The RPS increase across 3 iterations for m4.large is more significant than that for t2.micro: The reason is an instance with more memory can cache more of the previous requests for repeated tests. Caching is a vital performance tuning mechanism when building high performance applications. 13
P3.1: Files, SQL, and NoSQL: Task 1: analyze data in flat files Linux tools (e.g. grep, awk) Data libraries (e.g. pandas) Task 2: Explore a SQL database (MySQL) load data, run queries, indexing, auditing plain-sql v/s ORM Task 3: Implement a prototype of Redis Task 4: Explore a NoSQL DB (HBase) load data, run basic queries The NoSQL and HBase primers are vital for P3.1 14
Flat files, plain text or binary comma-separated values (CSV) format: Carnegie,Cloud Computing,A,2018 tab-separated values (TSV) format: Carnegie\tCloud Computing\tA\t2018 a custom and verbose format: Name: Carnegie, Course: Cloud Computing, Section: A, Year: 2018 15
Lightweight, Flexible, in favor of small tasks Run it once and throw it away Performing complicated analysis on data in files can be inconvenient Usually flat files should be fixed or append-only Writes to files without breaking data integrity is difficult Managing the relations among multiple files is also challenging 16
A collection of organized data Database management system (DBMS) Interface between user and data Store/manage/analyze data Relational databases Based on the relational model (schema) NoSQL Databases Unstructured/semi-structured Redis, HBase, MongoDB, Neo4J 17
Advantages Logical and physical data independence Concurrency control and transaction support Query the data easily (e.g. SQL)... Disadvantages Cost (computational resources, fixed schema) Maintenance and management Complex and time-consuming to design schema... 18
Compare flat files to databases Think about: What are the advantages and disadvantages of using flat files or databases? In what situations would you use a flat file or a database? How to design your own database? How to load, index and query data in a database? 19
Analyze Yelp s Academic Dataset https://www.yelp.com/dataset_challenge Answer questions in runner.sh Use tools such as awk, grep, pandas Similar to what you did in projects 1.1, 1.2 Merge TSV files by joining on a common field Identify the disadvantages of flat files 20
Prepare tables A script to create the table and load the data is already provided Use MySQL queries to answer questions Learn JDBC Complete MySQLTasks.java Aggregate functions, joins Statement and PreparedStatement SQL injection Learn how to use proper indexes to improve performance 21
Schema The structure of the tables and the relations between tables Based on the structure of the data Index: A data structure that helps speed up the retrieval of data from tables (B-Tree, Hash indexes, etc) Based on the data as well as queries You can build effective indexes only if you are aware of the queries you need. We have an insightful section about the practice of indexing, read it carefully! 22
How do we evaluate the performance of a query? Run it. What if we want/need to predict the performance without execution? Use EXPLAIN statements. An EXPLAIN statement on a query will predict: the number of rows to scan whether it makes use of indexes or not etc. 23
ORM is a technique to abstract away the work for you to: 1. Map the domain class with the database table 2. Map each field of the domain class with a column of the table 3. Map instances of the classes (objects) with rows in the corresponding tables 24
Separation of concerns. ORM decouples the CRUD operations and the business logic code. Productivity. You don t need to keep switching between your OOP language such as Java/Python, etc and SQL. Flexibility to meet evolving business requirements. ORM cannot eliminate the schema update problem, but it may ease the difficulty, esp. when used together with data migration tools. Persistence transparency. Any changes to a persistent object will be automatically propagated to the database without explicit SQL queries. Vendor independence. ORM can abstract the application away from the underlying SQL database and SQL dialect. 25
The current business application exposes an API that returns the most popular Pittsburgh businesses. It is based on a SQLite3 database with an outdated schema. Your task is to plug the business application to the MySQL database and update the definition of the domain class to match the new schema. The API will be backward compatible without modifying any business logic code. 26
CAP The CAP theorem: it is impossible for a distributed data store to provide all the following three guarantees at the same time Consistency: no stale data Availability: no downtime Partition Tolerance: network failure tolerance in a distributed system CA: non-distributed (MySQL, PostgreSQL) CP: downtime (HBase, MongoDB) AP: stale data (Amazon DynamoDB) 27
Non-SQL(Non-relational) or NotOnly-SQL Why NoSQL if we already have SQL solutions? Traditional SQL Databases do not always align with the needs of distributed systems. Flexible Data Model Basic Types of NoSQL Databases Schema-less Key-Value Stores (Redis) Wide Column Stores (Column Family Stores) (HBase) Document Stores (MongoDB) Graph DBMS (Neo4j) 28
Key value store is a type of NoSQL database e.g. Redis and Memcached Widely used as an in-memory cache Your task is to implement a simplified version of Redis We provide you with starter code Redis.java with operations unimplemented, you will implement two of the commonly used data structures supported by Redis: hashes and lists 29
Launch an EMR cluster with HBase installed. Follow the write-up to download and load the data into HBase. Try different querying commands in the HBase shell. Complete HBaseTasks.java using HBase Java APIs. 30
Tag your resources with: Key: Project, Value: 3.1 Stop the submitter instance to take a break and work on it later. Terminate the EMR cluster. Make sure to terminate the instance after finishing all questions and submitting your answers. 31
twitter DATA ANALYTICS: TEAM PROJECT
Team Project 33
Team Project Phase 1: Q1 Q2 (MySQL AND HBase) Input your team account ID and GitHub username on TPZ Phase 2 Q1 Q2 & Q3 (MySQL AND HBase) Phase 3 Q1 Q2 & Q3 & Q4 (MySQL OR HBase) 34
Team Project System Architecture Web server architectures Dealing with large scale real world tweet data HBase and MySQL optimization 35
Team Project Deadlines Writeup and queries were released on Monday, Feb 26th, 2018. Phase 1 milestones: Checkpoint 1: Report, due on Sunday, 3/11 Checkpoint 2: Q1 on scoreboard, due on Sunday, 3/25 Phase 1 Deadline: Q2 on scoreboard, due on Sunday, 4/1 Phase 1, code and report: due on Tuesday, 4/3 36
Git workflow Commit your code to the private repo we set up Update your GitHub username in TPZ! Make changes on a new branch Work on this branch, commit as you wish Open a pull request to merge into the master branch Code review Someone else needs to review and accept (or reject) your code changes This process will allow you to capture bugs and remain informed on what others are doing 37
Query 1, QR code Query 1 does not require a database Implement encoding and decoding of QR code A simplified version of QR You must explore different web frameworks Get at least 2 different web frameworks working Select the framework with the better performance Provide evidence of your experimentation 38
Query 1 Specification Structural components Two sizes of QR, depending on the message length Position detection patterns Alignment patterns Timing patterns Fill in a message Interleave chars & error detection codes Put in the payload, move in an S shape
Query 1 Example Encoding For a message, return the image encoded as a hex string The area in the red box in the image below Decoding Find the encoded image in a random background The image may be rotated 0, 90,180, or 270 degrees Use patterns to find the QR code! Then do the inverse of encoding Sorry, you can t scan it with a camera yet :(
Query 2, Hashtag Recommendation Query 2 is a DB read query Given a sentence, return hashtags that appeared most frequently with words in it You have to perform Extract, Transform and Load (ETL) Can be done on AWS, Azure or GCP Dataset is available on all three cloud platforms Pay close attention to the ETL rules and related definitions Make good use of the reference data & server In this query, you will need both front end + back end MySQL and HBase You need two 10-minute submissions, (Q2M & Q2H) by the deadline 41
Query 2 Example Assume that the dataset contains user 123456789: Cloud is great #aws user 987654321: Cloud computing is tough #azure Sample request and response Request: keywords=cloud,computing&userid=123456789&n=3 #aws: 2 points; #azure: 2 points Response: #aws,#azure Return top n hashtags (or all), break ties lexicographically 42
Phase (and query due) Start Deadline Code and Report Due 43
44
45