Introduction to Distributed Data Systems

Similar documents
Indexing Large-Scale Data

Advanced Data Management Technologies

Distributed Filesystem

CA485 Ray Walshe Google File System

Distributed File Systems II

The Google File System

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS 655 Advanced Topics in Distributed Systems

Distributed System. Gang Wu. Spring,2018

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Cloud Computing. Big Data. Dell Zhang Birkbeck, University of London 2017/18

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

CLOUD-SCALE FILE SYSTEMS

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

GFS: The Google File System

The Google File System

CISC 7610 Lecture 2b The beginnings of NoSQL

The Google File System (GFS)

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CS Amazon Dynamo

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

10. Replication. Motivation

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Map-Reduce. Marco Mura 2010 March, 31th

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed Data Store

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

MapReduce. U of Toronto, 2014

Scaling Out Key-Value Storage

The Google File System. Alexandru Costan

BigData and Map Reduce VITMAC03

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Google File System 2

Google File System. By Dinesh Amatya

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System

The Google File System

Presented By: Devarsh Patel

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

The Google File System

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

The Google File System

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Google File System. Arun Sundaram Operating Systems

Google Disk Farm. Early days

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Staggeringly Large Filesystems

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

CSE 124: Networked Services Lecture-16

EECS 498 Introduction to Distributed Systems

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

DIVING IN: INSIDE THE DATA CENTER

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

CS 345A Data Mining. MapReduce

CSE 124: Networked Services Fall 2009 Lecture-19

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CAP Theorem, BASE & DynamoDB

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Dynamo: Key-Value Cloud Storage

Modern Database Concepts

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

NPTEL Course Jan K. Gopinath Indian Institute of Science

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Lecture XIII: Replication-II

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

2/27/2019 Week 6-B Sangmi Lee Pallickara

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Distributed File Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Programming Models MapReduce

GFS-python: A Simplified GFS Implementation in Python

Next-Generation Cloud Platform

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

A BigData Tour HDFS, Ceph and MapReduce

Integrity in Distributed Databases

Map Reduce. Yerevan.

Transcription:

Introduction to Distributed Data Systems Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook January 13, 2014 WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 1 / 40

Introduction Outline 1 Introduction 2 Overview of distributed data management principles 3 Properties of a distributed system 4 Failure management 5 Consistent hashing 6 Case study: large scale distributed file systems (GFS) WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 2 / 40

Introduction Outline Guidelines and principles for distributed data management Basics of distributed systems (networks, performance, principles) Failure management and distributed transactions Properties of a distributed system Specifics of peer-to-peer networks A powerful and resilient distribution method: consistent hashing Case study: GFS, a distributed file system WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 3 / 40

Overview of distributed data management principles Outline 1 Introduction 2 Overview of distributed data management principles Data replication and consistency 3 Properties of a distributed system 4 Failure management 5 Consistent hashing 6 Case study: large scale distributed file systems (GFS) WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 4 / 40

Overview of distributed data management principles Distributed systems A distributed system is an application that coordinates the actions of several computers to achieve a specific task. This coordination is achieved by exchanging messages which are pieces of data that convey some information. shared-nothing architecture -> no shared memory, no shared disk. The system relies on a network that connects the computers and handles the routing of messages. Local area networks (LAN), Peer to peer (P2P) networks... Client (nodes) and Server (nodes) are communicating software components: we assimilate them with the machines they run on. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 5 / 40

Overview of distributed data management principles LAN-based infrastructure: clusters of machines Three communication levels: racks, clusters, and groups of clusters. message message server switch segment client WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 6 / 40

Overview of distributed data management principles Example: data centers Typical setting of a Google data center. 1 40 servers per rack; 2 150 racks per data center (cluster); 3 6,000 servers per data center; 4 how many clusters? Google s secret, and constantly evolving... Rough estimate: 150-200 data centers? 1,000,000 servers? WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 7 / 40

Overview of distributed data management principles P2P infrastructure: Internet-based communication Nodes, or peers communicate with messages sent over the Internet network. The physical route may consist of 10 or more forwarding messages, or hops. Suggestion: use the traceroute utility to check the route between your laptop and a Web site of your choice. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 8 / 40

Overview of distributed data management principles Performance Type Latency Bandwidth Disk 5 10 3 s (5 millisec.); At best 100 MB/s LAN 1 2 10 3 s (1-2 millisec.); 1Gb/s (single rack); 100Mb/s (switched); Internet Highly variable. Typ. 10-100 ms.; Highly variable. Typ. a few MB/s.; Bottom line (1): it is approx. one order of magnitude faster to exchange main memory data between 2 machines in the same rack of a data center, that to read on the disk. Bottom line (2): exchanging through the Internet is slow and unreliable with respect to LANs. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 9 / 40

Overview of distributed data management principles Distribution, why? Sequential access. It takes 166 minutes (more than 2 hours and a half) to read a 1 TB disk. Parallel access. With 100 disks, assuming that the disks work in parallel and sequentially: about 1mn 30s. Distributed access. With 100 computers, each disposing of its own local disk: each CPU processes its own dataset. Scalability The latter solution is scalable, by adding new computing resources. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 10 / 40

Overview of distributed data management principles What you should remember: performance of data-centric distr. systems 1 disk transfer rate is a bottleneck for large scale data management; parallelization and distribution of the data on many machines is a means to eliminate this bottleneck; 2 write once, read many: a distributed storage system is appropriate for large files that are written once and then repeatedly scanned; 3 data locality: bandwidth is a scarce resource, and program should be pushed near the data they must access to. A distr. system also gives an opportunity to reinforce the security of data with replication. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 11 / 40

Overview of distributed data management principles Data replication and consistency What is it about? Replication: a mechanism that copies data item located on a machine A to a remote machine B one obtains replica Consistency: ability of a system to behave as if the transaction of each user always run in isolation from other transactions, and never fails. Example: shopping basket in an e-commerce application. difficult in centralized systems because of multi-users and concurrency. even more difficult in distributed systems because of replica. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 12 / 40

Overview of distributed data management principles Data replication and consistency Some illustrative scenarios put(d) Client A put(d) Synchronous Server N1 replication Prim ary copy Client B read(d) Server N2 Replica a) Eager replication with primary copy Client A Client B put(d) put(d) read(d) Asynchronous Server N1 Server N2 replication Prim ary copy Replica b) Lazy replication with primary copy Client A Client B Client A Client B put(d) put(d) put(d) put(d) Synchronous Server N1 replication Prim ary copy Server N2 Replica Synchronous Server N1 replication Prim ary copy Server N2 Replica c) Eager replication, distributed d) Lazy replication, distributed WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 13 / 40

Overview of distributed data management principles Data replication and consistency Consistency management in distr. systems Consistency: essentially, ensures that the system faithfully reflects the actions of a user. Strong consistency (ACID properties) requires a (slow) synchronous replication, and possibly heavy locking mechanisms. Weak consistency accept to serve some requests with outdated data. Eventual consistency same as before, but the system is guaranteed to converge towards a consistent state based on the last version. In a system that is not eventually consistent, conflicts occur and the application must take care of data reconciliation: given the two conflicting copies, determine the new current one. Standard RDBMS favor consistency over availability one of the reasons (?) of the NoSQL trend. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 14 / 40

Overview of distributed data management principles Data replication and consistency Achieving strong consistency The 2PC protocol 2PC = Two Phase Commit. The algorithm of choice to ensure ACID properties in a distributed setting. Problem: the update operations may occur on distinct servers {S 1,...,S n }, called participants. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 15 / 40

Overview of distributed data management principles Data replication and consistency Two Phase Commit: Overview WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 16 / 40

Overview of distributed data management principles Data replication and consistency Exercises and questions You are a customer using an e-commerce application which is known to be eventually consistent (e.g., Amazon... ): 1 Show a scenario where you buy an item, but this item does not appear in your basket. 2 You reload the page: the item appears. What happened? 3 You delete an item from your command, and add another one: the basket shows both items. What happened? 4 Will the situation change if you reload the page? 5 Would you expect both items to disapear in an e-commerce application? WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 17 / 40

Properties of a distributed system Outline 1 Introduction 2 Overview of distributed data management principles 3 Properties of a distributed system 4 Failure management 5 Consistent hashing 6 Case study: large scale distributed file systems (GFS) WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 18 / 40

Properties of a distributed system Properties of a distributed system: (i) scalability Scalability refers to the ability of a system to continuously evolve in order to support an evergrowing amount of tasks. A scalable system should (i) distribute evenly the task load to all participants, and (ii) ensure a negligible distribution management cost. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 19 / 40

Properties of a distributed system Properties of a distributed system: (ii) efficiency Two usual measures of its efficiency are the response time (or latency) which denotes the delay to obtain the first item, and the throughput (or bandwidth) which denotes the number of items delivered in a given period unit (e.g., a second). Unit costs: 1 number of messages globally sent by the nodes of the system, regardless of the message size; 2 size of messages representing the volume of data exchanges. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 20 / 40

Properties of a distributed system Properties of a distributed system: (iii) availability Availability is the capacity of of a system to limit as much as possible its latency. Involves several aspects: Failure detection. monitor the participating nodes to detect failures as early as possible (usually via heartbeat messages); design quick restart protocols. Replication on several nodes. Replication may be synchronous or asynchronous. In the later case, a Client write() returns before the operation has been completed on each site. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 21 / 40

Failure management Outline 1 Introduction 2 Overview of distributed data management principles 3 Properties of a distributed system 4 Failure management 5 Consistent hashing 6 Case study: large scale distributed file systems (GFS) WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 22 / 40

Failure management Failure recovery in centralized DBMSs Common principles: 1 The state of a (data) system is the set of item committed by the application. 2 Updating in place is considered as inefficient because of disk seeks. 3 Instead, update are written in main memory and in a sequential log file. 4 Failure? The main memory is lost, but all the committed transactions are in the log: a REDO operations is carried out when the system restarts. implemented in all DBMSs. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 23 / 40

Failure management Failure and distribution First: when do we kown that a component is in failed state? periodically send message to each participant. Second: does the centralized recovery still hold? Yes, providing the log file is accessible... WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 24 / 40

Failure management Distributed logging Note: distributed logging can be asynchronous (efficient, risky) or asynchronous (just the opposite). WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 25 / 40

Consistent hashing Outline 1 Introduction 2 Overview of distributed data management principles 3 Properties of a distributed system 4 Failure management 5 Consistent hashing 6 Case study: large scale distributed file systems (GFS) WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 26 / 40

Consistent hashing Basics: Centralized Hash files The collection consists of (key, value) pairs. A hash function evenly distributes the values in buckets w.r.t. the key. hash file This is the basic, static, scheme: the number of buckets is fixed. Dynamic hashing extends the number of buckets as the collection grows the most popular method is linear hashing. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 27 / 40

Consistent hashing Issues with hash structures distribution Straighforward idea: everybody uses the same hash function, and buckets are replaced by servers. Two issues: Dynamicity. At Web scale, we must be able to add or remove servers at any moment. Inconsistencies. It is very hard to ensure that all participants share an accurante view of the system (e.g., the hash function). Some solutions: Distributed linear hashing: sophisticated scheme that allows Client nodes to use an outdated image of the has file; guarantees eventual convergence. Consistent hashing: to be presented next. NB: consistent hashing is used in several systems, including Dynamo (Amazon)/Voldemort (Open Source), and P2P structures, e.g. Chord. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 28 / 40

Consistent hashing Consistent hashing Let N be the number of servers. The following functions maps a pair (key,value) to server S i. hash(key) modulo(key,n) = i Fact: if N changes, or if a client uses an invalid value for N, the mapping becomes inconsistent. With Consistent hashing, addition or removal of an instance does not significantly change the mapping of keys to servers. a simple, non-mutable hash function h maps both the keys ad the servers IPs to a large address space A (e.g., [0,2 64 1]); A is organized as a ring, scanned in clockwise order; if S and S are two adjacent servers on the ring: all the keys in range ]h(s),h(s )] are mapped to S. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 29 / 40

Consistent hashing Illustration Example: item A is mapped to server IP1-2; item B to server... IP1-1 Object b IP2-2 IP1-2 Object a map IP3-1 A server is added or removed? A local re-hashing is sufficient. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 30 / 40

Consistent hashing Some (really useful) refinements What if a server fails? How can we balance the load? Failure use replication; put a copy on the next machine (on the ring), then on the next after the next, and so on. Load balancing map a server to several points on the ring (virtual nodes) IP3-2 IP2-2 IP1-1 Object b IP2-2 the more points, the more load received by a server; IP1-2 Virtual nodes of IP2-2 also useful if the server fails: data rellocation is more evenly distributed. also useful in case of heterogeneity (the rule in large-scale systems). Object a IP2-2 IP3-1 WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 31 / 40

Consistent hashing Distributed indexing based on consistent hashing Main question: where is the hash directory (servers locations)? Several possible answers: On a specific ( Master") node, acting as a load balancer. Example: caching systems. raises scalability issues. Each node records its successor on the ring. may require O(N) messages for routing queries not resilient to failures. Each node records log N carefully chosen other nodes. ensures O(log N) messages for routing queries convenient trade-off for highly dynamic networks (e.g., P2P) Full duplication of the hash directory at each node. ensures 1 message for routing heavy maintenance protocol which can be achieved through gossiping (broadcast of any event affecting the network topology). WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 32 / 40

Consistent hashing Case study: Dynamo (Amazon) A distributed system that targets high availability (your shopping cart is stored there!). Duplicates and maintains the hash directory at each node via gossiping queries can be routed to the correct server with 1 message. The hosting server replicates N (application parameter) copies of its objects on the N distinct nodes that follow S on the ring. Propagates updates asynchronously may result in update conflicts, solved by the application at read-time. Use a fully distributed failure detection mechanism (failure are detected by individual nodes when then fail to communicate with others) An Open-source version is available at http://project-voldemort.com/ WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 33 / 40

Case study: large scale distributed file systems (GFS) Outline 1 Introduction 2 Overview of distributed data management principles 3 Properties of a distributed system 4 Failure management 5 Consistent hashing 6 Case study: large scale distributed file systems (GFS) WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 34 / 40

Case study: large scale distributed file systems (GFS) History and development of GFS Google File System, a paper published in 2003 by Google Labs at OSDI. Explains the design and architecture of a distributed system apt at serving very large data files; internally used by Google for storing documents collected from the Web. Open Source versions have been developed at once: Hadoop File System (HDFS), and Kosmos File System (KFS). WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 35 / 40

Case study: large scale distributed file systems (GFS) The problem Why do we need a distributed file system in the first place? Fact: standard NFS (left part) does not meet scalability requirements (what if file1 gets really big?). Right part: GFS/HDFS storage, based on (i) a virtual file namespace, and (ii) partitioning of files in chunks. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 36 / 40

Case study: large scale distributed file systems (GFS) Architecture A Master node performs administrative tasks, while servers store chunks and send them to Client nodes. The Client maintains a cache with chunks locations, and directly communicates with servers. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 37 / 40

Case study: large scale distributed file systems (GFS) Technical details The architecture works best for very large files (e.g., several Gigabytes), divided in large (64-128 MBs) chunks. this limits the metadata information served by the Master. Each server implements recovery and replication techniques (default: 3 replicas). (Availability) The Master sends heartbeat messages to servers, and initiates a replacement when a failure occurs. (Scalability) The Master is a potential single point of failure; its protection relies on distributed recovery techniques for all changes that affect the file namespace. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 38 / 40

Case study: large scale distributed file systems (GFS) Workflow of a write() operation (simplified) The following figure shows a non-concurrent append() operation. In case of concurrent appends to a chunk, the primary replica assigns serial numbers to the mutation, and coordinates the secondary replicas. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 39 / 40

Case study: large scale distributed file systems (GFS) Namespace updates: distributed recovery protocol Extension of standard techniques for recovery (left: centralized; right: distributed). If a node fails, the replicated log file can be used to recover the last transactions on one of its mirrors. WebDam (INRIA) Introduction to Distributed Data Systems January 13, 2014 40 / 40