A Non-Relational Storage Analysis

Similar documents
Getting to know. by Michelle Darling August 2013

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Migrating Oracle Databases To Cassandra

CS 655 Advanced Topics in Distributed Systems

The new face of Cassandra. Michaël

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Tools for Social Networking Infrastructures

A Fast and High Throughput SQL Query System for Big Data

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Chapter 24 NOSQL Databases and Big Data Storage Systems

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

CISC 7610 Lecture 2b The beginnings of NoSQL

Architecture of a Real-Time Operational DBMS

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

NOSQL DATABASE SYSTEMS: DATA MODELING. Big Data Technologies: NoSQL DBMS (Data Modeling) - SoSe

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Scaling. Marty Weiner Grayskull, Eternia. Yashh Nelapati Gotham City

/ Cloud Computing. Recitation 8 October 18, 2016

Scaling. Yashh Nelapati Gotham City. Marty Weiner Krypton. Friday, July 27, 12

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Big Data Analytics. Rasoul Karimi

Datacenter replication solution with quasardb

Cassandra- A Distributed Database

Performance Evaluation of NoSQL Databases

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

When, Where & Why to Use NoSQL?

Microservices without the Servers: AWS Lambda in Action

CIB Session 12th NoSQL Databases Structures

Intro Cassandra. Adelaide Big Data Meetup.

[This is not an article, chapter, of conference paper!]

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Axway API Management 7.5.x Cassandra Best practices. #axway

Non-Relational Databases. Pelle Jakovits

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Comparing SQL and NOSQL databases

Write On Aws. Aws Tools For Windows Powershell User Guide using the aws tools for windows powershell (p. 19) this section includes information about

Use Cases for Partitioning. Bill Karwin Percona, Inc

Oracle NoSQL Database Enterprise Edition, Version 18.1

Migrating to Cassandra in the Cloud, the Netflix Way

Consistency Without Transactions Global Family Tree

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

MySQL In the Cloud. Migration, Best Practices, High Availability, Scaling. Peter Zaitsev CEO Los Angeles MySQL Meetup June 12 th, 2017.

Which technology to choose in AWS?

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Column-Family Databases Cassandra and HBase

Oracle NoSQL Database Enterprise Edition, Version 18.1

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

MySQL Cluster Web Scalability, % Availability. Andrew

Scaling for Humongous amounts of data with MongoDB

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Oracle NoSQL Database

BARNS: Backup and Recovery for NoSQL Databases

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Rule 14 Use Databases Appropriately

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Azure Cosmos DB NoSQL Migration

Extreme Computing. NoSQL.

Introduction to the Active Everywhere Database

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

CIT 668: System Architecture. Amazon Web Services

June 20, 2017 Revision NoSQL Database Architectural Comparison

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

10 Million Smart Meter Data with Apache HBase

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Relational to NoSQL: Getting started from SQL Server. Shane Johnson Sr. Product Marketing Manager Couchbase

Azure-persistence MARTIN MUDRA

Big Data Hadoop Course Content

Middle East Technical University. Jeren AKHOUNDI ( ) Ipek Deniz Demirtel ( ) Derya Nur Ulus ( ) CENG553 Database Management Systems

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN. PostgresConf US

Next-Generation Cloud Platform

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

microsoft

HA solution with PXC-5.7 with ProxySQL. Ramesh Sivaraman Krunal Bauskar

Equitrac Office and Express DCE High Availability White Paper

4 Myths about in-memory databases busted

10. Replication. Motivation

Ghislain Fourny. Big Data 5. Column stores

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Aurora, RDS, or On-Prem, Which is right for you

Distributed Non-Relational Databases. Pelle Jakovits

Mix n Match Async and Group Replication for Advanced Replication Setups. Pedro Gomes Software Engineer

Introduction to NoSQL Databases

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Typical size of data you deal with on a daily basis

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

/ Cloud Computing. Recitation 10 March 22nd, 2016

Improving NoSQL Database Benchmarking

MySQL & NoSQL: The Best of Both Worlds

Cloud Computing with FPGA-based NVMe SSDs

Adaptation in distributed NoSQL data stores

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Transcription:

A Non-Relational Storage Analysis Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Cloud Computing - 2nd semester 2012/2013 Universitat Politècnica de Catalunya

Microblogging - big data? Twitter: Over 500 million registered users. 340 million tweets per day. More than 12TB of data per day in 2010. Hard to provide this service relying solely on a small number of centralized servers. Scaling up has its limits.

Cassandra Tested version: 1.2.4, April 2013 Data model: Hybrid between key-value and tabular storage. Data split in column families (like RDS tables). Column families split into rows (indexed by row keys). Dynamic columns (schema). Interface: CQL via Thrift. Our setup: 256 virtual nodes per physical node. Cluster given topology awareness using EC2 snitch. Consistency level of ONE for R/W. Replication factor of 3 per datacenter.

Couchbase Tested version: 2.0.1, Apr 2013 Data model Key (String) - Value (JSON) View/Index: incremental MapReduce Automatic Replication Configurable replication factor for each bucket with a max of 3. Set to 3 in single region experiments Set to 2 in multi-region experiments Consistency: Strong consistency for access with "key" Eventual consistency for queries on views

PyDLoader Custom Python script 2 components: Manager: Interactive console. Deploy new nodes. Install and control slaves on those nodes. Slaves: Automatic setup, populating and takedown of database clusters (database slaves). Generation of application workload towards database clusters (workload slaves). Used libraries: Boto (AWS), Paramiko (SSH), RpyC (RPC).

Evaluation Done using Amazon Web Services (AWS). 6 database nodes (m1.small) 1.7 GB of RAM 1 compute unit 150GB of storage 12 workload nodes (t1.micro) 615 MB or RAM 1 compute unit No local storage (NAS) Distributed equally over 2 datacenters in Ireland.

Evaluation Focus on 3 main operations: Tweet Userline (all tweets by user) Timeline (all tweets by people user is following). Limit to 50 items per userline/timeline. Basic data structure: Aggregate data according to operations, not entities: Timeline table/bucket: UserID, TweetID, PostedBy, Body Userline table/bucket: UserID, TweetID, Body

Ease of Setup Cassandra - Awesome! Install, configure addresses, partitioner, snitch, replication factors and seed nodes, launch! Automatic partitioning and replication. Automatic adjustment to churn. Couchbase - Equivalently Awesome! Install, configure RAM, directory, launch! Easily add/failover/remove servers Rebalance: background process, asynchronous, incremental Automatic replication, partitioning and node failure detection.

Setup time Cassandra: 6 nodes form and stabilize ring after ~2 minutes. Populating with sample data: ~1 minute in 1 node configuration. 6.5 minutes in 6 node configuration. Couchbase: 6 nodes form and stabilize ring after ~2 minutes. Populating with sample data: ~1 minute in 1 node configuration. ~2 minutes in 6 node configuration.

Latency Tweet Latency Userline Latency

Latency Timeline Latency

Normalized latency Tweet Latency

Normalized latency Userline Latency Timeline Latency

Tweets & Denormalization Tweet Latency v.s. Number of Followers

Tweets & Denormalization Immediate denormalization not scalable. How to make asynchronous? Cassandra: No native support. External processing. Couchbase: Views!

Reconfiguration Latency Cassandra Adding a new node to a cluster: ~5 mins. Latency while node joining cluster

Reconfiguration Latency Couchbase Adding a new node to a cluster: immediate BUT rebalance ~ 30 mins Latency while node joining cluster

Consistency/Convergence Cassandra: Average of 0.096498 seconds to detect new tweet. Standard deviation of 0.096319: Same datacenter => very fast detection. Different datacenter => slower detection. Couchbase: Average of 0.007501 seconds to detect new tweet with standard deviation of 0.012476 The delay for new tweet to appear on timeline proportional to the schedule period

Replication Cassandra: Very flexible in terms of replication configuration. Per datacenter replication factors with no hard-coded limitations. Behaviour under crash of 2 nodes @ second 100th

Replication Couchbase Automatic and configurable per bucket with a limit of (1+3) replicas. Behaviour under crash of 2 nodes @ second 30th, 100th

Load Balancing Cassandra: Average data ownership per node: 16.68% Standard deviation: 1.23% Couchbase: Evenly distributed with standard deviation: 0.65% for data on disk 0.04% for RAM usage

System Load CPU Idle Time Free Memory

System Load Disk IO

Disk Space Usage SQLite database: 1.6MB Cassandra: Fully denormalized (single node): 16.31MB. Fully denormalized (6-nodes): 9.5MB/node. Includes partitioning and replication. "Normalized" (no body in userline/timeline): Single node: 2.8MB 6-nodes: 1.6MB/node. Commit log after populating: 54MB Need 50% of free disk space at all times: Column family compactions. Data redistributions.

Disk Space Usage Couchbase: Total of 250MB distributed across 6 nodes. No minimum free space requirement. Lack of disk usage limitations in both DBs: Not ideal for voluntary computing systems.

Multi Region Setup Nodes distributed across Ireland and N.Virginia. Single Region Multiple Region Cassandra Couchbase Cassandra Couchbase Setup+Populate time 8.5 min 4 min 8.7 min 12 min Avg. Tweet 0.0405 sec 0.0696 sec 0.3043 sec 0.0681 sec Avg. Userline 0.0093 sec 0.0208 sec 0.0477 sec 0.0169 sec Avg. Timeline 0.0101 sec 0.3639 sec 0.0836 sec 0.9791 sec Consistency μ = 0.096498 σ = 0.096319 μ = 0.007501 σ = 0.012476 μ = 0.609560 σ = 1.080330 μ = 2.929887 σ = 3.384052

Conclusion Easy cluster setup Allows horizontal scaling over multiple nodes Improved performance through denormalization Higher storage requirements The future is hybrid A mix of RDS and NoSQL-Systems Cassandra's CQL, CouchBase's Views, NoSQL in RDS