Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Similar documents
Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

CLOUD-SCALE FILE SYSTEMS

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

CS November 2017

BigData and Map Reduce VITMAC03

Tools for Social Networking Infrastructures

Dept. Of Computer Science, Colorado State University

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Intra-cluster Replication for Apache Kafka. Jun Rao

MapReduce. U of Toronto, 2014

CA485 Ray Walshe Google File System

Distributed File Systems II

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Programming Models MapReduce

Hadoop Online Training

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

CISC 7610 Lecture 2b The beginnings of NoSQL

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Map-Reduce. Marco Mura 2010 March, 31th

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Distributed Filesystem

Map Reduce. Yerevan.

Esper EQC. Horizontal Scale-Out for Complex Event Processing

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Developing MapReduce Programs

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

BigTable: A Distributed Storage System for Structured Data

A BigData Tour HDFS, Ceph and MapReduce

CS November 2018

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

MI-PDB, MIE-PDB: Advanced Database Systems

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Outline. Spanner Mo/va/on. Tom Anderson

Big Data Analytics. Rasoul Karimi

Hadoop. copyright 2011 Trainologic LTD

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Importing and Exporting Data Between Hadoop and MySQL

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Cloud Computing CS

SCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING

The Google File System

Introduction to MySQL Cluster: Architecture and Use

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Presented by Nanditha Thinderu

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Oracle NoSQL Database Enterprise Edition, Version 18.1

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

CS 655 Advanced Topics in Distributed Systems

Distributed Systems 16. Distributed File Systems II

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Replication in Distributed Systems

The Google File System

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CSE 124: Networked Services Lecture-16

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Big Data Hadoop Course Content

Google File System. Arun Sundaram Operating Systems

Data Informatics. Seon Ho Kim, Ph.D.

Migrating from Oracle to Espresso

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Lecture 11 Hadoop & Spark

Huge market -- essentially all high performance databases work this way

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Streaming data Model is opposite Queries are usually fixed and data are flows through the system.

Chapter 8: Working With Databases & Tables

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Extreme Computing. NoSQL.

Chapter 5. The MapReduce Programming Model and Implementation

Processing Twitter Data with MongoDB. Xiaoxiao Liu

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Course Content MongoDB

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Performance and Scalability with Griddable.io

MySQL Cluster An Introduction

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Clustering Lecture 8: MapReduce

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Introduction to MapReduce

The course modules of MongoDB developer and administrator online certification training:

A Non-Relational Storage Analysis

Oracle NoSQL Database Enterprise Edition, Version 18.1

Transcription:

Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29

Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29

Data Intensive Web Sites Many data intensive sites contain the following features: People you may know... Items you may like (recommendations) Relationships between pairs of people LinkedIn (135 million users) features many more such relationships Phases Three phases : data collection, processing, serving Smruti R. Sarangi Leader Election 3/29

LinkedIn Voldemort is a key-value system for finally serving the data. LinkedIn uses Voldemort (which is now open source) Several other sites such eharmony, and Nokia use Voldemort A Hadoop engine processes all the data in LinkedIn s data store, and creates a read-only database that Voldemort uses Support for : quick updates, load balancing Optimized for serving bulk read-only data ( quickly ) Twice as fast as MySQL (4TB of new data every day) Smruti R. Sarangi Leader Election 4/29

Related Work Two MySQL solutions: MyISAM and InnoDB MyISAM Compact on-disk data structure Creates index after loading the data file Locks the complete table (during loading) InnoDb Supports fine-grained row-level locking Very slow Requires a lot of disk space PNUTS (Yahoo) CPUs are shared between data loading modules and data serving modules Reduces the performance of data serving modules Smruti R. Sarangi Leader Election 5/29

Outline 1 2 3 Smruti R. Sarangi Leader Election 6/29

Clusters, Nodes and Stores A Voldemort cluster contains multiple nodes A physical host can run multiple nodes Each node has a given number of stores (database table) Example: Store : member id recommended group ids Store : group id description Attributes of a store: Replication factor number of nodes that contain this store Read/write quorum size Serialization (XML/JSON) and compression engine: Berkeley DB or MySQL Smruti R. Sarangi Leader Election 7/29

Responses and Requests Client API Conflict Resolution Serialization Conflict Resolution Response Request Network Client / Server Engine Smruti R. Sarangi Leader Election 8/29

Outline 1 2 3 Smruti R. Sarangi Leader Election 9/29

Module Deals with partitioning and replication Splits the hash ring into equal size partitions Maps partitions to nodes Each key is map to a preference list (of partitions) Map to a primary partition (based on its hash) N 1 subsequent partitions (clockwise) Pluggable Layer Traditional get and put functions Block-read functions (streaming) Read-only operations Administrative Functions Smruti R. Sarangi Leader Election 10/29

Modes Definition ( Modes) Client Side Retrieves meta-data and cluster topology. Makes the routing decisions locally. Server Side Server takes all the routing decisions. Smruti R. Sarangi Leader Election 11/29

Outline 1 2 3 Smruti R. Sarangi Leader Election 12/29

Shortcomings of MySQL and BDB For bulk loading, multiple put requests is not the correct solution. We need to update the underlying B+ tree multiple times. Alternative Solution Maintain a separate server that maintains a copy of the database Switch to the new copy instantaneously DISADVANTAGE : Double the resources, bulk copy overhead Alternative Solution 2 Run Hadoop to generate indices offline Atomically switch to the new set of indices Problem of additional resources Smruti R. Sarangi Leader Election 13/29

Requirements Minimize the performance overhead of live requests. Scaling and fault tolerance Fast rollback capability Ability to handle large datasets Smruti R. Sarangi Leader Election 14/29

List of Steps Smruti R. Sarangi Leader Election 15/29

Format Voldemort is Java based (on a JVM), and uses the OS to manage its memory The input data destined for a node is split into multiple chunk buckets Each chunk bucket is split into multiple chunk sets A chunk bucket is defined by the partition id and replica id Each chunk set has a : data file and an index file Naming convention of a chunk set file partition id_replica id_chunk set id.{data,index} Entry in the index file: top 8 bytes of MD5 signature + 4 byte offset in the data file Smruti R. Sarangi Leader Election 16/29

Structure of the Data File Stores the number of collided tuples Plus, list of collided tuples key size, value size, key value Smruti R. Sarangi Leader Election 17/29

Chunk Set Generation Inputs: chunk sets per bucket, cluster topology, store definitions, input data locations in HDFS Mapper Emits upper 8 bytes of the MD5 key along with node id, partition id, replica id, key, and value Partitioner Route data to the correct reducer based on the chunk set id Reducer Every reducer is responsible for only one chunk set Hadoop sorts the inputs based on the keys Each Voldemort node is a directory in HDFS files with files for each chunk set. Smruti R. Sarangi Leader Election 18/29

Data Versioning Every store is represented by a directory Every version of a store has a unique directory The current version of a store points to the right directory using a symbolic link For moving to a new version: Get a read-write lock on the previous version s directory Close all the files Open the files in the new directory and map them to memory Switch the symbolic link Smruti R. Sarangi Leader Election 19/29

Data Loading The driver triggers a Hadoop job to create a new version. It then triggers a fetch request on all the Voldemort nodes. Use a pull based modeling (traffic based fetch throttling) Swap the version s directory Global atomic semantics Takes 0.05 ms in LinkedIn Smruti R. Sarangi Leader Election 20/29

Outline 1 2 3 Smruti R. Sarangi Leader Election 21/29

Retrieval Calculate the MD5 of the key Generate primary partition id, any replica id, chunk set id (first 4 bits of MD5) Find the chunk set index file Locate the value in the chunk set data file Searching in the index file is the most time consuming process Ensure that they are in main memory by fetching them at the end User interpolation search (O(log(log(N)))) Smruti R. Sarangi Leader Election 22/29

Schema Upgrades and Rebalancing Schema Upgrades Use version bits with each JSON file The mapping of the version bits to the schema is saved in the meta-data section of the store definition. Rebalancing We can dynamically add partitions. Create a plan for moving partitions and their replicas. Start moving the partitions, and lazily propagate information about temporary topologies. Smruti R. Sarangi Leader Election 23/29

Setup Simulated data set: Keys (random 1024 byte strings) Linux based setup : Dual CPU 8 cores, 24 GB RAM Smruti R. Sarangi Leader Election 24/29

Build Time Smruti R. Sarangi Leader Election 25/29

Read Latency Smruti R. Sarangi Leader Election 26/29

Queries Per Second Smruti R. Sarangi Leader Election 27/29

People You May Know Smruti R. Sarangi Leader Election 28/29

Collaborative Filtering Smruti R. Sarangi Leader Election 29/29

Sumbaly, Roshan, et al. "Serving large-scale batch computed data with project voldemort." Proceedings of the 10th USENIX conference on File and Technologies. USENIX Association, 2012. Smruti R. Sarangi Leader Election 29/29