CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Similar documents
NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

Big Data Hadoop Stack

Shark: Hive (SQL) on Spark

April Copyright 2013 Cloudera Inc. All rights reserved.

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Oracle NoSQL Database and Cisco- Collaboration that produces results. 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

VOLTDB + HP VERTICA. page

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Architecture of a Real-Time Operational DBMS

Understanding NoSQL Database Implementations

Data Storage Infrastructure at Facebook

BigFUN: A Performance Study of Big Data Management System Functionality

VoltDB vs. Redis Benchmark

MySQL Cluster Web Scalability, % Availability. Andrew

DATABASE SCALE WITHOUT LIMITS ON AWS

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Safe Harbor Statement

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Achieving Horizontal Scalability. Alain Houf Sales Engineer

CSE 344 JULY 9 TH NOSQL

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Perspectives on NoSQL

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

CompSci 516 Database Systems

Chapter 24 NOSQL Databases and Big Data Storage Systems

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Presented by Nanditha Thinderu

A Fast and High Throughput SQL Query System for Big Data

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Stages of Data Processing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Big Data Analytics. Rasoul Karimi

CSE6331: Cloud Computing

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Big Data Architect.

Shen PingCAP 2017

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

A Glimpse of the Hadoop Echosystem

Big Data with Hadoop Ecosystem

Albis: High-Performance File Format for Big Data Systems

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

DATABASE DESIGN II - 1DL400

Warehouse- Scale Computing and the BDAS Stack

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

NoSQL Performance Test

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

SCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Intro to MongoDB. Alex Sharp.

NewSQL Databases. The reference Big Data stack

Couchbase Architecture Couchbase Inc. 1

Course Content MongoDB

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

CISC 7610 Lecture 2b The beginnings of NoSQL

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Next-Generation Cloud Platform

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Microsoft Big Data and Hadoop

MySQL & NoSQL: The Best of Both Worlds

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

a linear algebra approach to olap

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Class Overview. Two Classes of Database Applications. NoSQL Motivation. RDBMS Review: Client-Server. RDBMS Review: Serverless

MapR Enterprise Hadoop

NVMe SSDs Future-proof Apache Cassandra

Importing and Exporting Data Between Hadoop and MySQL

Introduction to NoSQL Databases

VoltDB for Financial Services Technical Overview

CAST(HASHBYTES('SHA2_256',(dbo.MULTI_HASH_FNC( tblname', schemaname'))) AS VARBINARY(32));

MySQL High Availability

Appliances and DW Architecture. John O Brien President and Executive Architect Zukeran Technologies 1

NewSQL Databases MemSQL and VoltDB Experimental Evaluation

Scaling for Humongous amounts of data with MongoDB

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

CSE 344 Final Review. August 16 th

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Micron and Hortonworks Power Advanced Big Data Solutions

Hadoop Development Introduction

Transcription:

Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576)

Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL server and MongoDB Evaluation Traditional DSS Workload: Hive vs PDW Modern OLTP Workload: MongoDB vs SQL Server Discussion Conclusion

Traditional DBMS are under attack from two spectrum: Document store NoSQL MongoDB NoSQL data analytic systems Hive on Hadoop Compare the performance and scalability of SQL system to the NoSQL system on: Interactive data serving applications which are characterized by the OLTP workloads? SQL Server vs MongoDB using the YCSB benchmark Decision support analysis on massive amount of Data OLAP PDW vs Hive using the TPC-H DSS benchmark YCSB (Yahoo! Cloud Serving Benchmark): A framework for evaluating the performance of different key-value serving store It can compare between several workloads (eg: 70% reads and 30% writes) TPC-H: A decision support benchmark It executes queries with a high degree of complexity Evaluates the performance of various support systems by execution of sets of queries against standard database

Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL server and MongoDB

Shared nothing parallel database system built on top of SQL Server Multiple compute nodes, a single control node and other administrative service nodes. Each compute node is a separate server running SQL Server The control node is responsible for handling the user query and generating an optimized plan of parallel operations

An open-source data warehouse built on top of Hadoop. Originally build by Facebook, used to analyze and query 15TB of log data each day. Provides a structured data model for data that is stored in the Hadoop Distributed Filesystem (HDFS), and a SQL-like query language- HiveQL for writing concise queries on data in hive tables Compiles the queries into efficiently chained map-reduce jobs Tables can be partitioned (range partition e.g. by date) and also bucketed (hash partitions - may improve queries)

Scalable, high-performance, open source, schema-free, document oriented database. Support indexing in the form of B-trees and replications of data between servers Documents are stored in BSON (binary JSON) Javascript-based query syntax RDBMS Tables Records/Rows Queries return Record(s) MongoDB Collections Documents/Objects Queries return a cursor

MongoDB-AS: Auto Sharding uses a form of range partitioning Server-CS and MongoDB-CS Implemented by author Use client-side sharding to determine the home node/shard both use hash partitioning no automatic load balancing

Cluster of 16 nodes connected by 1 GB HP Procurve 2510G 48 (J9280A) Ethernet switch Each node has dual Intel Xeon L5630 quad-core processors running at 2.13 GHz, 32 GB of main memory, and 10 SAS 10K RPM 300GB hard drives 1 Hard drive reserved for the OS 8 disk to store the data OS : Hive Ubuntu 11.04 Other Windows Server 2008 R2

TCP-H benchmark used for testing Use TCP-H at 4 scale factors: 250GB, 1000GB, 4000GB, 16000GB Each scale factor represent cases where different portions of the TPC-H tables fit in main memory Executed 22 TCP-H queries

Data Layout Hive tables Contain both partitions and buckets: In this case the table consists of a set of directories (one for each partition) Each directory contains a set of files, each one corresponding to one bucket only partitions: Can improve selections queries but not join only buckets: Can help improve the performance of joins but cannot improve selections queries PDW : Can be horizontally partitioned: The records are distributed to the partitions using a hash function on a partition column Can be replicated across all the nodes

Data Preparation and Load Times Hive : Generated dataset across 16 nodes Create one hive table for each TCP-H table Data is loaded in two phase: Data files loaded onto each node Data is converted from text to RCFile format PDW Load data into landed node Create necessary tables

Performance Analysis : PDW is faster than Hive in for all TPC-H queries The average speedup of PDW over Hive is greater for small datasets Hive has high overheads for small datasets. Scalability Analysis: Hive scales better than PDW Hive scales well as the dataset size increase

Discussion: PDW outperforming Hive for the following reason Hive s most efficient storage format (RCFile) is CPU-bound during scans. Cost-based optimization in PDW results in better join ordering. Hive doesn t exploit partitioning attributes as well as PDW does. Hive scales better than PDW for the following reason It has extra overheads for small datasets For some queries, increasing the dataset size does not affect the query time, since there is enough available parallelism to process the data Some tasks take the same amount of time at all scale factors

YCSB data-serving benchmark used for testing Workload Description: Extends YCSB into 2 ways : added support for multiple instances on many database servers added support for stored procedures in the YCSB JDBC driver Ran the YCSB benchmark on a database that consists of 640 million records (80M records per node). The dataset size per node is approximately 2.5 larger than the available main memory at each server machine.

Workload Description (cont..) : Five workloads Each read request reads all the record fields Each update request updates only one field. Each scan request reads at most 1,000 records from the database. Each append request inserts a new record in the database whose key has the next greater value than that of the last inserted key. Each workload is run for 30 minutes. The values of latency and throughput reported are the average values over the last 10 minutes of execution, measured every 10 second interval.

Discussion: Performance of NoSQL still lags behind RDBMS SQL-CS was able to achieve higher throughput than the MongoDB for the same number of clients And had lower latency across for almost every single test of the YCSB benchmark. Interestingly, this is the case even when the NoSQL system did not provide any form of durability.

Results find that the relational systems continue to provide a significant performance advantage over their NoSQL counterparts, but the NoSQL alternatives can incorporate the techniques used by RDBMSs to improve their performance. SQL systems and NoSQL systems have different focuses on non-performance related features. (e.g. support for auto-shading and automatic load balancing and different data models). It is likely that in the future these systems will start to compete on the functionality aspects.