Apache Accumulo 1.4 & 1.5 Features

Similar documents
Accumulo Extensions to Google s Bigtable Design

Distributed File Systems II

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

HBase. Леонид Налчаджи

BigTable. CSE-291 (Cloud Computing) Fall 2016

Distributed Filesystem

NPTEL Course Jan K. Gopinath Indian Institute of Science

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Distributed PostgreSQL with YugaByte DB

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Distributed Computation Models

TRANSACTIONS AND ABSTRACTIONS

Release Notes for Patches for the MapR Release

HBase Solutions at Facebook

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

CS November 2018

Map-Reduce. Marco Mura 2010 March, 31th

CS November 2017

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Operating Systems. Operating Systems Professor Sina Meraji U of T

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

The Google File System

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

CA485 Ray Walshe Google File System

Distributed computing: index building and use

CLOUD-SCALE FILE SYSTEMS

HARDFS: Hardening HDFS with Selective and Lightweight Versioning

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Data Access 3. Managing Apache Hive. Date of Publish:

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

HBASE INTERVIEW QUESTIONS

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

MapReduce. U of Toronto, 2014

Snapshots and Repeatable reads for HBase Tables

Google File System. Arun Sundaram Operating Systems

BookKeeper overview. Table of contents

Tuning Enterprise Information Catalog Performance

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

CSE 124: Networked Services Lecture-16

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Apache Flink. Alessandro Margara

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Applications of Paxos Algorithm

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

10 Million Smart Meter Data with Apache HBase

Introduction to MapReduce

Trinity File System (TFS) Specification V0.8

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Google File System 2

BigTable: A Distributed Storage System for Structured Data

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

BigData and Map Reduce VITMAC03

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Distributed computing: index building and use

What is a file system

Typical size of data you deal with on a daily basis

Ghislain Fourny. Big Data 5. Wide column stores

CSE-E5430 Scalable Cloud Computing Lecture 9

W b b 2.0. = = Data Ex E pl p o l s o io i n

Cloudera Kudu Introduction

TRANSACTIONS OVER HBASE

Trafodion Enterprise-Class Transactional SQL-on-HBase

Clustering Lecture 8: MapReduce

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Hadoop. copyright 2011 Trainologic LTD

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

CS5412: OTHER DATA CENTER SERVICES

Chapter 11: File System Implementation. Objectives

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

CSE 124: Networked Services Fall 2009 Lecture-19

The Google File System

Dept. Of Computer Science, Colorado State University

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

ZooKeeper. Wait-free coordination for Internet-scale systems

Fluentd + MongoDB + Spark = Awesome Sauce

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

CS 318 Principles of Operating Systems

The Google File System (GFS)

7680: Distributed Systems

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

Distributed Systems. 29. Distributed Caching Paul Krzyzanowski. Rutgers University. Fall 2014

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

File Systems: Consistency Issues

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Transcription:

Apache Accumulo 1.4 & 1.5 Features Inaugural Accumulo DC Meetup Keith Turner

What is Accumulo? A re-implementation of Big Table Distributed Database Does not support SQL Data is arranged by Row Column Time No need to declare columns, types

Massively Distributed

Key Structure Key Value Row Column Family Column Qualifier Column Visibility Time Stamp Value Primary Partitioning Component Secondary Partitioning Component Uniqueness Control Access Control Versioning Control Value

Tablet Organization

System Processes

Tablet Data Flow

Rfile Multi-level index

1.3 Rfile Index Index is an array of keys One key per data block Complete index loaded into memory when file opened Complete index kept in memory when writing file Slow load times Memory exhaustion

1.3 RFile Index Data Block Data Block Data Block

Non multi-level 9G file Locality group : <DEFAULT> Start block : 0 Num blocks : 182,691 Index level 0 size : 8,038,404 bytes First key : 00000008611ae5d6 0e94:74e1 [] 1310491465052 false Last key : 7ffffff7daec0b4e 43ee:1c1e [] 1310491475396 false Num entries : 172,014,468 Column families : <UNKNOWN> Meta block : BCFile.index Raw size : 2,148,491 bytes Compressed size : 1,485,697 bytes Compression type : lzo Meta block : RFile.index Raw size : 8,038,470 bytes Compressed size : 5,664,704 bytes Compression type : lzo

1.4 RFile Index Level 1 Index Level 0 Data Block Data Block Data Block Data Block Data Block

File layout Data Data Index L0 Data Data Index L0 Data Data Index L0 Index L1 Index blocks written as data is written complete index not kept in memory during write Index block for each level kept in memory

Multi-level 9G file Locality group : <DEFAULT> Start block : 0 Num blocks : 187,070 Index level 1 : 4,573 bytes 1 blocks Index level 0 : 10,431,126 bytes 82 blocks First key : 00000008611ae5d6 0e94:74e1 [] 1310491465052 false Last key : 7ffffff7daec0b4e 43ee:1c1e [] 1310491475396 false Num entries : 172,014,468 Column families : <UNKNOWN> Meta block : BCFile.index Raw size : 5 bytes Compressed size : 17 bytes Compression type : lzo Meta block : RFile.index Raw size : 4,652 bytes Compressed size : 3,745 bytes Compression type : lzo

Test Setup Files in FS cache for test (cat test.rf > /dev/null) Too much variability when some of file was in FS cache and some was not Easier to force all of file into cache than out

Open, seek 9G file Multilevel Index Cache Avg Time N N 139.48 ms N Y 37.55 ms Y N 8.48 ms Y Y 3.90 ms

Randomly seeking 9G file Multilevel Index Cache Avg Time N N 2.08 ms N Y 2.32 ms Y N 4.33 ms Y Y 2.14 ms

Configuration Index block size configurable per table Make index block size large and it will behave like old rfile Enabled index cache by default in 1.4

Cache hit rate on monitor page

Fault tolerant concurrent table operations

Problematic situations Master dies during create table Create and delete table run concurrently Client dies during bulk ingest Table deleted while writing or scanning

Create table fault Master dies during Could leave accumulo metadata in bad state Master creates table, but then dies before notifying client If client retries, it gets table exist exception

FATE Fault Tolerant Executor If process dies, previously submitted operations continue to execute on restart. Serializes operation in zookeeper before execution Master uses FATE to execute table operations

Bulk import test Create table with summation aggregator Bulk import files of +1 or -1, graph ensures final sum is 0 Concurrently merge, split, compact, and consistency check table Exit point verifies 0

Adampotent Idempotent f(f(x)) = f(x) Adampotent f(f'(x)) = f(x) f'(x) denotes partial execution of f(x)

REPO Repeatable Persisted Operation public interface Repo<T> extends Serializable { long isready(long tid, T environment) throws Exception; Repo<T> call(long tid, T environment) throws Exception; void undo(long tid, T environment) throws Exception; } call() returns next op, null if done call() and undo() must be adampotent undo() should clean up partial execution of isready() or call()

FATE interface long starttransaction(); void seedtransaction(long tid, Repo op); TStatus waitforcompletion(long tid); Exception getexception(long tid); void delete(long tid);

Client calling FATE on master Start Tx Seed Tx Wait Tx Delete Tx

FATE transaction states NEW IN PROGRESS FAILED IN PROGRESS SUCCESSFUL FAILED

FATE Persistent store public interface TStore { long reserve(); Repo top(long tid); void push(long tid, Repo repo); void pop(long tid); Tstatus getstatus(long tid); void setstatus(long tid, Tstatus status);... }

Seeding void seedtransaction(long tid, Repo repo){ if(store.getstatus(tid) == NEW){ if(store.top(tid) == null) store.push(tid, repo); store.setstatus(tid, IN_PROGRESS); } }

FATE transaction execution Mark success Reserve tx Execute top op Push op Save Exception Mark failed in progress Pop op Undo top op Mark fail

Concurrent table ops Per-table read/write lock in zookeeper Zookeeper recipe with slight modification : store fate tid as lock data Use persistent sequential instead of ephemeral sequential

Concurrent delete table Table deleted during read or write 1.3 Most likely, scan hangs forever Rarely, permission exception thrown (seen in test) In 1.4 check if table exists when : Nothing in!metadata See a permission exception

Rogue threads Bulk import fate op pushes work to tservers Threads on tserver execute after fate op done Master deletes WID in ZK, waits until counts 0 Tablet server worker thread Synchronized Synchronized Start WID Exist? cnt[wid]++ Do work cnt[wid]-- exit

Create Table Ops 1. CreateTable Allocates table id 2. SetupPermissions 3. PopulateZookeeper Reentrantly lock table in isready() Relate table name to table id 4. CreateDir 5. PopulateMetadata 6. FinishCreateTable

Random walk concurrent test Perform all table operation on 5 well known tables Run multiple test clients concurrently Ensure accumulo or test client does not crash or hang Too chaotic to verify data correctness Found many bugs

Future Work Allow multiple processes to run fate operations Stop polling Hulk Smash Tolerant

Other 1.4 Features

Tablet Merging Splits exists forever; data is often aged off Admin requested, not automatic Merge by range or size

Merging Minor Compaction 1.3: minor compactions always add new files: 1.4: limits the total of new files:

Merging Minor Compactions Slows down minor compactions Memory fills up Creates back-pressure on ingest Prevents unable to open enough files to scan

Table Cloning Create a new table using the same read-only files Fast Testing At scale, with representative data Compaction with a broken iterator: no more data Offline and Copy create an consistent snapshot in minutes

Range Operations Compact Range Compact all tablets that fall within a row range down to a single file Useful for tail-insert indexes Data purge Delete Range Delete all the content within a row range Uses split and will delete whole tablets for efficiency Useful for data purge

Logical Time for Bulk Import Bulk files created on different machines will get different actual times (hours) Bulk files always contain the timestamp created by the client A single bulk import request can set a consistent time across all files Implemented using iterators

Roadmap

1.4.1 Improvements Snappy, LZ4 HDFS Kerberos compatability Bloom filter improvements Map Reduce directly over Accumulo files Server side row select iterator Bulk ingest support for map only jobs BigTop support Wikisearch example improvements

Possible 1.5 features

Performance Multiple namenode support WAL performance improvements In-memory map locality groups Timestamp visibility filter optimization Prefix encoding Distributed FATE ops

API Stats API Automatic deployment of iterators Tuplestream support Coprocessor integration Support other client languages Client session migration

Admin/Security Kerberos support/pluggable authentication Administration monitor page Data center replication Rollback/snapshot backup/recovery

Reliability Decentralize master operations Tablet server state model

Testing Mini-cluster test harness Continuous random walk testing