Scalable Table Stores: Tools for Understanding Advanced Key-Value Systems for Hadoop

Size: px

Start display at page:

Download "Scalable Table Stores: Tools for Understanding Advanced Key-Value Systems for Hadoop"

Logan Hood
5 years ago
Views:

with Julio Lopez, Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, CMU to appear in

1 Scalable Table Stores: Tools for Understanding Advanced Key-Value Systems for Hadoop Garth Gibson Professor, Carnegie Mellon Univ., & CTO, Panasas Inc. with Julio Lopez, Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, CMU to appear in SoCC October 2011 with Wittawat Tantisiriroj, Swapnil Patil, CMU Seung Woo Son, Sam Lang, Rob Ross, Argonne Nat Lab to appear in SC11 November 2011

The Future is Data-Led Expert human translator BLEU Score Usable translation Human-edittable translation Topic identification Useless 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.

2 The Future is Data-Led Expert human translator BLEU Score Usable translation Human-edittable translation Topic identification Useless Google ISI IBM+CMU UMD JHU+CU Edinburgh Systran Mitre NIST: translate 100 articles Arabic-English competition 2005 outcome: Google wins! Qualitatively better on 1st entry Brute force statistics with more data & compute!! FSC IEEE Intelligent Systems, March/April M words from UN translations 1 billion words of English grammar 1000 processor cluster 2

3 Science of Many Types is Data-Led Contact Field Comments J Lopez, CSD Astrophysics SDSS digital sky survey including spectroscopy, 50TB T Di Matteo, Physics Astrophysics Bigben BHCosmo hydrodynamics (1B particles simulated), 30TB F Gilman, Physics Astrophysics Large Synoptic Survey Telescope, LSST (2012) digital sky survey, 15TB/day C Langmead,CSD Biology Xray, NMR, CryoEM images; Sim d molecular dynamics trajectories J Bielak, CE Earth sciences USGS sensor images; Sim d 4D earthquake wavefields >10TB/run D Brumley, ECE Cyber security Worldwide Malware Archive; 2TB doubling each year O Mutlu, ECE Genomics 50GB per compressed genome sequencing; expands to TBs to process B Yu, ECE Neuroscience Neural recordings (electrodes, optical) for prosthetics; GB each J Callan, LTI Info Retrieval ClueWeb09, 25TB, 1B high rank web pages, 10 languages T Mitchell, MLD Machine Learning English sentences of ClueWeb for continuous automated reading (5TB) M Herbert, RI Image Understanding Flickr archive (>4TB); broadcast TV archive; street video; soldier video Y Sheikh, RI Virtual Reality Terascale VR sensor, 1000 camera+ 200 microphone, up to 5TB/sec C Guestrin, CSD Machine Learning Blog update archives, 2TB now + 2.7TB/yr (about 500K blogs/day) C Faloutsos, CSD Data Mining Wikipedia change archive (1TB), Fly embryo images (1.5TB), links from Yahoo web S Vogel, LTI Machine Translation Pre-filtered N-gram language model based on statistics on word alignment, 100 TB J Baker, LTI Machine Translation Spoken language recording archive, many languages, many sources, up to 1PB B Becker, RI Computer Vision Social network image/video archive for training computer vision systems, 1-5TB 3

4 CMU PDL History of Scalable Storage 1995 DARPA funds Network-Attached Secure Disks (NASD) NASD spin offs Object Storage Device standardized by T10/SCSI 2004, 2009 Panasas parallel storage system, Gibson co-founder & CTO Primary storage on first petascale computer: LANL Roadrunner Also: NIH, Citadel, ING, BNP, BP, ConocoPhillips, PetroChina, StatOil, Ferrari, BMW, 3M, Lockheed Martin, Northrop Grumman, Sandia, NASA Lustre Linux open source parallel file system With Panasas, Lustre & PVFS, 3/4 top500.org are object-based Graduates go to storage, server & internet companies Eg Google FileSystem (2003) & BigTable (2006) cloud database Parallel NFS achieves IETF RFC in 2010 spurred on by Panasas Linux adoption in , 3.0 and 3.1 (2011) 4

10GE trunk SFP+ twinax 48 port 10-GE switch 38 10GE links SFP+ twinax Logical rack: 32 worker nodes 6 RAID protected storage nodes PDL & OpenCirrus 6x 10GE trunk SFP+ twinax 48 port 10-GE switch 39

5 For the Experience, Operate A Cloud Two clusters: 3 TF, 2.2 TB, 142 nodes, 1.1K cores, ½ PB Available to CMU escience users as a Hadoop queue IR, ML classes ML research Comp bio research Astro research Geo research Malware analysis Social network analysis Systems research 6x 10GE trunk SFP+ twinax 48 port 10-GE switch 38 10GE links SFP+ twinax Logical rack: 32 worker nodes 6 RAID protected storage nodes PDL & OpenCirrus 6x 10GE trunk SFP+ twinax 48 port 10-GE switch 39 10GE links SFP+ twinax Logical rack: 32 worker nodes 7 RAID protected storage nodes External switch 2 x 10GE SR optical link 24 port 10-GE switch Other OpenCloud sites 10 Gbps LR optical link to NLR 2x 10GE trunk Switch 1-GE down/ 10-GE up 20 x 2 x 1GE links Logical rack: 20 worker nodes 2x 10GE trunk Switch 1-GE down/ 10-GE up 19 x 2 x 1GE links Logical rack: 19 worker nodes 2x 10GE trunk Switch 1-GE down/ 10-GE up 20 x 2 x 1GE links Logical rack: 20 worker nodes 2x 10GE trunk Switch 1-GE down/ 10-GE up 19 x 2 x 1GE links Logical rack: 19 worker nodes CMU OpenCloud CMU OpenCirrus 5

6 To Understand: Cloud FS vs. Parallel FS Hadoop s storage, library, HDFS, is replaceable Replace with PVFS, a user-level Parallel FS, to understand differences Buf: Prefetching HDFS: write once Add deep prefetch Map: Layout Stripe Unit::Node Optimized Launch Rep: Replicate data No HW RAID! To be published in SC11, Nov

7 Replication inside a PVFS file PVFS, like most cluster/parallel file systems, assumes RAID HW HDFS, like GoogleFS, does not like scaling of RAID HW Teach PVFS client to internally replicate (Hybrid approx. HDFS) Code is not production quality error path is too hard for academics J 7

8 Interesting Implementation Issues HDFS performance disk-bound by chunk creation PVFS insufficient parallelism in single stream 8

9 Differences Not Visible in Apps OpenCloud Apps Astrophysics Social network analysis Hadoop helps Job scheduler does load balancing Dataset is directory of files 9

10 Scalable Table Stores Inspired by Google s BigTable Reported to SCALE: >76 PB in one database >10 M operation/sec B-tree with giant nodes Data model is dynamic, lots of columns, strings everywhere Writeback of mutations written as sorted, indexed log files Read-misses search all logs: Log-structured Merge Trees Layered on GFS (HDFS) 10

11 Extending a Prior Benchmark Tool Yahoo! Cloud Serving Benchmark (YCSB) tool steady state load of CRUD (create-read-update-delete) operations Command-line parameters DB to use Target throughput Number of threads Workload parameter file R/W mix Record size Data set YCSB client Workload executor Client threads Stats DB client Cloud DB Extensible: define new workloads Extensible: plug in new clients github.com/brianfrankcooper/ycsb [SoCC10] 11

12 Adv. Features of YCSB++ High Ingest Rate Features Deep batch writing Pre-splitting tablets (given future insert distribution) Bulk-load: MR format map files externally Read Features Read-after-write: what price eventual consistency? Offloading filtering to servers Security ACLs what performance price? Better interpretation of monitoring Integrate knowledge of services, user jobs (Otus) To be published in SoCC (October 2011)

13 Workload parameter file -! R/W mix! -! RecordSize! -! DataSet! -!! Extensions YCSB++ Framework Command-line parameters (e.g, DB name, NumThreads)! YCSB Client (with our extensions) Workload Executor New workloads Multi-Phase Processing Client Threads Stats DB Clients YCSB metrics API ext HBase IcyTable Other DBs Accumulo Client nodes YCSB Client Coordination ZooKeeper-based barrier sync and event notification Ganglia monitoring Hadoop, HDFS and OS metrics Storage Servers github.com/milopolte/ycsb (pushing to main branch) 13

14 Accepted into Apache Incubator Sept

Extensions for Monitoring (Otus) Virtual Memory

Usage HDFS DataNode Read Request From Remote

Tracker HDFS DataNode CPU Usage Service stats

looking for specific command lines Aggregate stats

15 Extensions for Monitoring (Otus) Virtual Memory (Bytes) Running Map Tasks Read Requests (ops) CPU Usage HDFS DataNode Read Request From Remote Clients Other Processes Other MRJob Data Node Task Tracker HDFS DataNode CPU Usage Service stats (Hadoop, Hbase, HDFS, ) Walk process group tree looking for specific command lines Aggregate stats for subgroups Customizable displays github.com/otus/otus 15

16 Server side Filtering DoD BigTable HBase Filtering when little data is desired leads to excessive prefetching on server, because it fills scanner batch Size scanner batch to expected result size (scaled buffer) Hbase table was decomposed into more columnar stores, so Accumulo does more work 16

17 Batch Writers & Eventual Consistency Small batches burn excessive client CPU, limiting thruput Large batches saturate servers, limiting benefit of batch 17

18 Batch Writers & Eventual Consistency Deferred write wins, but visible latency can be 100 secs Fraction of requests Fraction of requests (a) HBase: Time lag for different buffer sizes 10 KB 100 KB 1 MB 10 MB 1 (b) IcyTable: 10 Time 100 lag for 1000 different buffer sizes read-after-write time lag (ms) 10 KB 100 KB 1 MB 10 MB read-after-write time lag (ms) 18

19 Pre- (and post-) Tablet Splitting 6 servers Per server: Preload 1M rows; Load 8M rows; Measure@100 ops/s 20% faster load if pre-split post-load rebalancing hurts for minutes 19

20 Improving Ingest Speed: Bulk Load Faster ingest is format with MapReduce, ingest/import with bulk load, rebalance during measurement phase Test: preload, monitor/measure, format bulk, bulk load, monitor/measure, sleep 5 minutes, monitor/measure Per server: Preload 1M rows; Load 8M rows; Measure@100 ops/s Import turns out to be nearly instant, but rebalancing is not Load 48M rows one at a time: secs, mins Bulk load, including formatting time: 5-12 mins (2-5X faster) Data becomes available Queries may slow down Map Reduce Import Rebalancing End-to-end ingest time 20

21 Scaling & Bulk Loading 1/8M rows per server Accumulo (1) (3) (4) (6) 54 Servers (36MF) Servers (36MF) Scaling MR means more files & more compaction Data becomes available Queries may slow down 6 Servers (36MF) Map Reduce Import Rebalancing End-to-end ingest time Minutes PreLoad PL-Rebalance BulkLoad BL-Rebalance 21

22 Rebalancing Timeline (54 Servers/36 MapFiles) (1) (4) (6) (7) (8) (3) Phase 1 rebalancing starts late Too much rebalancing work 22

23 So How Do We Test At Scale? At Cloud scale very few users can afford extended experiment time on public clouds Many systems experiments desire: repeatable, isolated, instrumented, fault-injected, specialized kernels Almost no one running a public cloud could (would) (SHOULD) support such invasive apps 23

24 LANL was going to trash this! 24

25 NSF PRObE to the Rescue NSF Funds the New Mexico Consortium to recycle LANL supercomputers PRObE: Parallel Reconfigurable Observational Environment Low level systems research facility Days to weeks of dedicated usage Complete control of hardware and software Fault injection and failure statistics 25

26 PRObE Hardware Plan Spring 2012: Sitka (2048 cores) acquired 1024 Nodes, Dual Socket, Single Core AMD Opteron; 4GB RAM per core; Full fat-tree Myrinet Summer 2012: Kodiak (2048 cores) acquired 1024 Nodes, Dual Socket, Single Core AMD Opteron, 4GB RAM per core; Fat-tree SDR Infiniband 128 Nodes version at CMU, Marmot, standing up now Fall 2011: Susitna (1700 Cores) being acquired 26 Nodes, 16 core CPUs, 1 GB RAM / core, QDR Infiniband, GPU Planning to build at CMU soon Fall 2013: Nome (1600 cores) anticipated 200 Nodes, Quad Socket, Dual Core AMD Opteron; 2GB RAM per core, Fat-tree DDR Infiniband Fall 2013: Matanuska (3456 Cores) anticipated 36 Nodes, 24 core CPUs, 1-2GB RAM / core, 100Gbit Ethernet 26

Utah On staging clusters, also on large clusters Enhanced for PRObE hardware, scale, networks, resource partitioning

27 PRObE Software First, none is allowed Researchers can put any software they want onto the clusters Second, a well known tool managing clusters of hardware for research Emulab ( Flux Group, U. Utah On staging clusters, also on large clusters Enhanced for PRObE hardware, scale, networks, resource partitioning policies, remote power and console, failure injection, deep instrumentation PRObE provides hardware support (spares) 27 Garth Gibson, Oct 2010!

28 For Systems Research Users NSF who can apply rules Includes international and corporate research projects ( best in partnership with US university) newmexicoconsortium.org/probe 28 Garth Gibson, Oct 2010!

29 On Education Front: BigData Masters Extends MSIT Very Large Information Systems (VLIS) Tracks for BigData systems and applications One year on campus, incl. two project courses, plus 7 month internship at end Already using Hadoop on OpenCloud cluster in some courses Systems courses: Distr d Computing, Storage Systems, Cloud Computing, Data Mining, Parallel Comp. Arch & Programming Applications courses: VLIS, Software Eng., Machine Learning, Information Retrieval Seeking students, internship & permanent employers It s all about expanding training of BigData professionals 29

30 Research Sponsors Companies of Parallel Data Consortium: APC, EMC, Facebook, Google, Hewlett-Packard, Hitachi, Intel, Microsoft, NEC, NetApp, Oracle, Panasas, Riverbed, Samsung, Seagate, STEC, Symantec, VMware

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie