YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Similar documents
YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

CSE-E5430 Scalable Cloud Computing Lecture 9

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Comparing SQL and NOSQL databases

Experiment-Driven Evaluation of Cloud-based Distributed Systems

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Ghislain Fourny. Big Data 5. Wide column stores

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Distributed File Systems II

Google Cloud Bigtable. And what it's awesome at

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Data Informatics. Seon Ho Kim, Ph.D.

Introduction to NoSQL Databases

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

10 Million Smart Meter Data with Apache HBase

BigTable: A Distributed Storage System for Structured Data

Ghislain Fourny. Big Data 5. Column stores

Crossing the Chasm: Sneaking a parallel file system into Hadoop

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

MDHIM: A Parallel Key/Value Store Framework for HPC

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Adaptation in distributed NoSQL data stores

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

HBASE INTERVIEW QUESTIONS

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Architecture of a Real-Time Operational DBMS

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Distributed Data Analytics Partitioning

CISC 7610 Lecture 2b The beginnings of NoSQL

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

Presented by Nanditha Thinderu

Time Series Storage with Apache Kudu (incubating)

CS 655 Advanced Topics in Distributed Systems

Typical size of data you deal with on a daily basis

Lecture 11 Hadoop & Spark

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Part 1: Indexes for Big Data

HBase... And Lewis Carroll! Twi:er,

HBase. Леонид Налчаджи

CS November 2018

NoSQL BENCHMARKING AND TUNING. Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India

CS November 2017

EsgynDB Enterprise 2.0 Platform Reference Architecture

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

VOLTDB + HP VERTICA. page

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

Flash Storage Complementing a Data Lake for Real-Time Insight

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

Let s Make Parallel File System More Parallel

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Data Acquisition. The reference Big Data stack

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CLOUD-SCALE FILE SYSTEMS

Tools for Social Networking Infrastructures

Outline. Spanner Mo/va/on. Tom Anderson

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Processing of big data with Apache Spark

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

A BigData Tour HDFS, Ceph and MapReduce

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Scalable Streaming Analytics

Huge market -- essentially all high performance databases work this way

Unifying Big Data Workloads in Apache Spark

V Conclusions. V.1 Related work

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Big Data Analytics. Rasoul Karimi

Introduction to Apache Beam

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Tuning Enterprise Information Catalog Performance

Apache Accumulo 1.4 & 1.5 Features

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

YCSB+T: Bench-marking Web-Scale Transactional Databases

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Extreme Computing. NoSQL.

Data Acquisition. The reference Big Data stack

Deployment Planning and Optimization for Big Data & Cloud Storage Systems

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

The Google File System

NOSQL DATABASE PERFORMANCE BENCHMARKING - A CASE STUDY

CS5412: OTHER DATA CENTER SERVICES

Transcription:

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores Swapnil Patil M. Polte, W. Tantisiriroj, K. Ren, L.Xiao, J. Lopez, G.Gibson, A. Fuchs *, B. Rinaldi * Carnegie Mellon University * National Security Agency

Importance of scalable table stores For data processing and analysis For systems services (e.g., metadata in Colossus) http://www.pdl.cmu.edu/ 2

Growing complexity of table stores Growing set of HBase features 2008 2009 2010 2011+ HBASE release RangeRowFilters Batch updates Bulk load tools RegEx filtering Scan optimizations Co-processors Access Control Simple, lightweight complex, feature-rich stores Supports a broader range of applications Hard to debug performance issue and complex component interactions http://www.pdl.cmu.edu/ 3

State of table store benchmarking YCSB: Yahoo Cloud Serving Benchmark [Cooper2010] Modular design to test different table stores Great for CRUD (create-read-update-delete) benchmarking, but not for sophisticated features Need richer tools for understanding advanced features in table stores http://www.pdl.cmu.edu/ 4

This talk: YCSB++ tool NEW EXTENSIONS IN YCSB++ Distributed, coordinated and multi-phase testing Fine-grained, correlated monitoring using OTUS [Ren2011] TABLE STORE FEATURES TESTED BY YCSB++ Batch writing Table pre-splitting Bulk loading Weak consistency Server-side filtering Fine-grained security Tool released at http://www.pdl.cmu.edu/ycsb++ http://www.pdl.cmu.edu/ 5

Talk Outline Motivation YCSB++ architecture Illustrative examples of using YCSB++ Summary and ongoing work http://www.pdl.cmu.edu/ 6

Original YCSB framework HBASE Workload Parameters Workload Executor Threads API Adaptor OTHER DBS Stats Storage Servers Configurable workload generation to test stores API adaptor converts read(k) to hbase_get(k) http://www.pdl.cmu.edu/ 7

YCSB++ supports new table store HBASE Workload Parameters EXTENSIONS Workload Executor EXTENSIONS Threads Stats API Adaptor ACCUMULO Storage Servers New DB adaptor for Apache Accumulo table store New parameters and workload executor extensions http://www.pdl.cmu.edu/ 8

Coordinated & multi-phase tests MULTI-PHASE HBASE Workload Parameters EXTENSIONS Workload Executor EXTENSIONS Threads Stats API Adaptor ACCUMULO COORDINATION YCSB clients Storage Servers ZooKeeper-based coordination & synchronization Enables heavy workloads and asymmetric testing http://www.pdl.cmu.edu/ 9

Coordinated & multi-phase tests Distributed, multi-client tests using YCSB++ Allows clients to co-ordinate their test actions Rely on shared data structures in ZooKeeper Useful for testing weak data consistency Multi-phase tests in YCSB++ Can construct tests comprising of different phases Built on ZooKeeper-based barrier-synchronization Used for understanding high-speed ingest features http://www.pdl.cmu.edu/ 10

Collective monitoring in YCSB++ MULTI-PHASE HBASE Workload Parameters EXTENSIONS Workload Executor EXTENSIONS Threads Stats API Adaptor ACCUMULO COORDINATION OTUS MONITORING YCSB clients Storage Servers Fine-grained resource monitoring using Otus [Ren2011] Collects from YCSB, table stores, HDFS and /proc http://www.pdl.cmu.edu/ 11

Talk Outline Motivation YCSB++ architecture Illustrative examples of using YCSB++ Case study: HBase and Accumulo Both are Bigtable-like table stores Summary and ongoing work http://www.pdl.cmu.edu/ 12

Primer on Bigtable-like stores (1) Incoming mutation logged in memory (unsorted order) (2) MINOR COMPACTION Memtables written to sorted, indexed store files in HDFS Data Insertion 1 Memtable 2 Tablet T N Tablet Servers Write Ahead Log Sorted Indexed Store Files 3 (Fewer) Store Files HDFS nodes (3) MAJOR COMPACTION LSM-tree based file merging (in background) http://www.pdl.cmu.edu/ 13

Accumulo table store Started at NSA; now an Apache project Built for high-speed ingest and scan workloads http://incubator.apache.org/projects/accumulo.html New features in Accumulo Iterator framework for user-specified programs placed in different stages of DB pipeline E.g., Supports joins and stream processing Also provides fine-grained cell-level access control http://www.pdl.cmu.edu/ 14

Before I talk about examples YCSB++ provides Abstractions to construct distributed, parallel tests Has in-built tests that use these abstractions Monitoring that collects and correlates system (store/fs/os) state with observed performance YCSB++ does not provide Root cause diagnosis of performance problems Merely points you to where you should look http://www.pdl.cmu.edu/ 15

FEATURES TESTED BY YCSB++ ILLUSTRATIVE EXAMPLE Table bulk loading Batch writing Weak consistency Table pre-splitting Server-side filtering Access control Table bulk loading High-speed ingestion through minimal data migration Need careful tuning and configuration [Sasha2002] http://www.pdl.cmu.edu/ 16

Table bulk loading in action (2) IMPORT store files into table stores to make data available for users Tablet servers Hbase Hbase Hbase Hbase Data files Hadoop tool HFile HFile HFile HFile HDFS cluster (1) FORMAT existing data files to store-file specific format using Hadoop http://www.pdl.cmu.edu/ 17

8-phase bulk load test in YCSB++ Measurement phase Light mix of Read/Update operations Interleaved to study performance over time P f P i M M S M L f L i Pre-load data Insert 6M rows in empty table Load data Load 48M rows in existing table Sleep Let servers finish balancing work http://www.pdl.cmu.edu/ 18

Multi-phase tests show variation P f P i M L f L i M S M ACCUMULO Read Latency (ms) 1000 100 10 1 0 60 120 180 240 300 0 60 120 180 240 300 0 60 120 180 240 300 Measurement phase time (s) 10x latency variation; lasts for a long time! Uniformly low latency after store is steady (no inserts) http://www.pdl.cmu.edu/ 19

Monitoring rebalancing at servers P f P i M L f L i M S M 1000 StoreFiles 100 Tablets Compactions 10 1 0 300 600 900 1200 1500 Running time of the 8-phase test (sec) Let s take a closer look at correlating performance with server-side state http://www.pdl.cmu.edu/ 20

Effect of server-side work on latency StoreFiles ACCUMULO Read Latency (ms) 1000 100 1000 100 10 1 0 60 120 180 240 300 StoreFiles and Tablets increase with splitting Tablets Compactions 10 1 0 60 120 180 240 300 360 Experiment Running Time (sec) http://www.pdl.cmu.edu/ 21 Background compactions reduce number of store files

YCSB++ helps study different policies ACCUMULO Read Latency (ms) R/U 1 (Phase 3) R/U 2 (Phase 6) R/U 3 (Phase 8) 10000 1000 1000 100 10 1 0 60 120 180 240 300 Measurement Phase Running Time (Seconds) HBASE Read Latency (ms) 10000 1000 1000 100 10 1 0 60 120 180 240 300 Measurement Phase Running Time (Seconds) 100 StoreFiles 100 10 Tablets 10 Compactions 1 0 300 600 900 1200 1500 1800 ACCUMULO Experiment Running Time (sec) 1 0 300 600 900 1200 1500 1800 HBASE Experiment Running Time (sec) http://www.pdl.cmu.edu/ 22

FEATURES TESTED BY YCSB++ ILLUSTRATIVE EXAMPLE Table bulk loading Batch writing Weak consistency Table pre-splitting Server-side filtering Access control Batching writes at clients Improves insert throughput and latency Newly inserted data may not be immediately visible to others http://www.pdl.cmu.edu/ 23

Batching improves throughput Inserts per second (1000s) 60 50 40 30 20 10 0 Hbase Accumulo 10 KB 100 KB 1 MB 10 MB Batch size 6 clients create 9 million 1-KB records on 6 servers Small batches: high client CPU utilization limits work Large batches: servers are saturated, limits benefit http://www.pdl.cmu.edu/ 24

Weak consistency test in YCSB++ Table store servers CLIENT #1 CLIENT #2 Insert {K:V} (10 6 records) Store client Batch YCSB++ ZooKeeper 1 2 3 Enqueue K (sample 1% records) Poll and dequeue K Store client 4 Read {K} YCSB++ ZooKeeper-based multi-client coordination Clients use a shared producer-consumer queue to communicate keys to be tested http://www.pdl.cmu.edu/ 25

Test setup details YCSB++ tests on 1% of keys inserted by C1 C1 inserts 1 million keys, C2 reads 10K keys Sampling avoids overloading ZooKeeper R-W lag for key K = time required by C2 to read K successfully If C2 can t read K in the first attempt, tries again Report the time lag for fraction of keys that need multiple read()s http://www.pdl.cmu.edu/ 26

Batch writing causes time lag Fraction of requests 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 HBASE lag for different buffer sizes (a) HBase: Time lag for different buffer sizes 10 KB ( <1%) 100 KB (7.4%) 1 MB ( 17%) 10 MB ( 23%) 0.001 0.01 0.1 1 10 100 read-after-write time lag (ms) Read-after-Write time lag (sec) 1 10 100 1000 10000 100000 Fraction of requests 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ACCUUMULO lag for different buffer sizes (b) Accumulo: Time lag for different buffer sizes 10 KB ( <1%) 100 KB (1.2%) 1 MB ( 14%) 10 MB ( 33%) 0.001 0.01 0.1 1 10 100 read-after-write time lag (ms) Read-after-Write time lag (sec) 1 10 100 1000 10000 100000 Delayed writes may not be seen for ~100 seconds Batching implementations affect latency; YCSB++ helps understand differences http://www.pdl.cmu.edu/ 27

FEATURES TESTED BY YCSB++ OTHER DETAILS Batch writing Weak consistency Table bulk loading Table pre-splitting Server-side filtering Access control ACM SOCC 2011 paper available Poster session(s) http://www.pdl.cmu.edu/ 28

Evolving YCSB++ Future work Study additional table stores (Cassandra, MongoDB and CouchDB) Test more features: Iterators and co-processors Understanding table store features Cost-benefit tradeoff of different heuristics for compactions on tablet servers Dynamo-style eventual consistency http://www.pdl.cmu.edu/ 29

Summary: YCSB++ tool For performance debugging & benchmarking advanced features using extensions to YCSB Weak consistency semantics Fast insertion (pre-splits, bulk loads) Server-side filtering Fine-grained access control Distributed clients using ZooKeeper Multi-phase testing (with Hadoop) New workload generators and database client API extensions Two case-studies: HBase & Accumulo Download at http://www.pdl.cmu.edu/ycsb++ http://www.pdl.cmu.edu/ 30