Distributed Systems. Tutorial 9 Windows Azure Storage

Similar documents
Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

EECS 498 Introduction to Distributed Systems

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

CLOUD-SCALE FILE SYSTEMS

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System 2

Distributed File Systems II

GFS: The Google File System. Dr. Yingwu Zhu

Google File System. Arun Sundaram Operating Systems

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

Windows Azure Services - At Different Levels

The Google File System (GFS)

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

The Google File System

GFS: The Google File System

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

MapReduce. U of Toronto, 2014

CA485 Ray Walshe Google File System

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Distributed System. Gang Wu. Spring,2018

Azure-persistence MARTIN MUDRA

The Google File System. Alexandru Costan

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

BigData and Map Reduce VITMAC03

Google File System. By Dinesh Amatya

Hadoop and HDFS Overview. Madhu Ankam

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed Filesystem

The Google File System

Google Disk Farm. Early days

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed Systems 16. Distributed File Systems II

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Staggeringly Large Filesystems

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and

CS November 2017

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. GFS / HDFS / Spanner

NPTEL Course Jan K. Gopinath Indian Institute of Science

CSE 124: Networked Services Fall 2009 Lecture-19

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

MapReduce & BigTable

Microsoft Azure Storage

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

The Google File System

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Staggeringly Large File Systems. Presented by Haoyan Geng

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Distributed File Systems

The Google File System

CSE 124: Networked Services Lecture-16

Intra-cluster Replication for Apache Kafka. Jun Rao

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

ECS High Availability Design

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

NPTEL Course Jan K. Gopinath Indian Institute of Science

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Applications of Paxos Algorithm

BigTable: A Distributed Storage System for Structured Data

Apache BookKeeper. A High Performance and Low Latency Storage Service

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Map-Reduce. Marco Mura 2010 March, 31th

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

SMD149 - Operating Systems - File systems

UNIT-IV HDFS. Ms. Selva Mary. G

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Google is Really Different.

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

STORM AND LOW-LATENCY PROCESSING.

MI-PDB, MIE-PDB: Advanced Database Systems

The Google File System GFS

Introduction to MySQL InnoDB Cluster

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Patterns on XRegional Data Consistency

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Oracle NoSQL Database at OOW 2017

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Microsoft Azure BLOB Storage

MySQL Replication Options. Peter Zaitsev, CEO, Percona Moscow MySQL User Meetup Moscow,Russia

Transcription:

Distributed Systems Tutorial 9 Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012

Windows Azure Storage (WAS) A scalable cloud storage system In production since November 2008 used inside Microsoft for applications such as social networking search, serving video, music and game content, managing medical records and more Thousands of customers outside Microsoft Anyone can sign up over the Internet to use the system. 2

WAS Abstractions Blobs File system in the cloud Tables Massively scalable structured storage Queues Reliable storage and delivery of messages A common usage pattern is incoming and outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs. 3

Design goals Highly Available with Strong Consistency Provide access to data in face of failures/partitioning Durability Replicate data several times within and across data centers Scalability Need to scale to exabytes and beyond Provide a global namespace to access data around the world Automatically load balance data to meet peak traffic demands 4

Global ed Namespace http(s)://accountname.<service>.core.windows.net/name/ ObjectName <service> can be a blob, table or queue. AccountName is the customer selected account name for accessing storage. The Account name specifies the data center where the data is stored. An application may use multiple AccountNames to store its data across different locations. Name locates the data once a request reaches the storage cluster When a Name holds many objects, the ObjectName identifies individual objects within that partition The system supports atomic transactions across objects with the same Name value The ObjectName is optional since, for some types of data, the Name uniquely identifies the object within the account. 5

Storage Stamps A storage stamp is a cluster of N racks of storage nodes. Each rack is built out as a separate fault domain with redundant networking and power. Clusters typically range from 10 to 20 racks with 18 diskheavy storage nodes per rack. The first generation storage stamps hold approximately 2PB of raw storage each. The next generation stamps hold up to 30PB of raw storage each. 6

High Level Architecture Access blob storage via the URL: http://<account>.blob.core.windows.net/ Data access Storage Location Service LB LB Front-Ends Layer Stream Layer Intra-stamp replication Storage Stamp Inter-stamp (Geo) replication Front-Ends Layer Stream Layer Intra-stamp replication Storage Stamp 7

Storage Stamp Architecture Stream Layer Append-only distributed file system All data from the Layer is stored into files (extents) in the Stream layer An extent is replicated 3 times across different fault and upgrade domains With random selection for where to place replicas Checksum all stored data Verified on every client read Re-replicate on disk/node/rack failure or checksum mismatch Stream Layer (Distributed File System) 8 M M Paxos M Extent Nodes (EN)

Storage Stamp Architecture Partiton Layer Provide transaction semantics and strong consistency for Blobs, Tables and Queues Stores and reads the objects to/from extents in the Stream layer Provides inter-stamp (geo) replication by shipping logs to other stamps Scalable object index via partitioning Layer Master Lock Service 9 Server Server Server Server

Storage Stamp Architecture Front End Layer Stateless Servers Authentication + authorization Request routing 10

Storage Stamp Architecture Front End Layer FE Ack Incoming Write Request FE FE FE FE Layer Server Server Master Server Server Lock Service Stream Layer M M Paxos M Extent Nodes (EN) 11

Layer Scalable Object Index 100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp Need to efficiently enumerate, query, get, and update them Traffic pattern can be highly dynamic Hot objects, peak load, traffic bursts, etc Need a scalable index for the objects that can Spread the index across 100s of servers Dynamically load balance Dynamically change what servers are serving each part of the index based on load 12

Scalable Object Index via ing Layer maintains an internal Object Index Table for each data abstraction Blob Index: contains all blob objects for all accounts in a stamp Table Entity Index: contains all table entities for all accounts in a stamp Queue Message Index: contains all messages for all accounts in a stamp Scalability is provided for each Object Index Monitor load to each part of the index to determine hot spots Index is dynamically split into thousands of Index Ranges based on load Index Ranges are automatically load balanced across servers to quickly adapt to changes in load 13

Layer Index Range ing Blob Index Account Container Name Name Blob Name aaaa aaaa aaaaa............ Account.. Container harry pictures.. sunrise.. Blob Name Name Name.. Front-End.... harry pictures Server sunset........ A-H:.. PS1.... H -R:.. PS2.. Account Container Blob richard R -Z: Name videos PS3 Name soccer.... Name.. richard videos tennis........ Map.......... zzzz zzzz zzzzz Storage Stamp PS 1 Server A-H Map A-H: PS1 H -R: PS2 Master R -Z: PS3 Server Server H -R R -Z PS 2 PS 3

Layer Range A Range uses a Log-Structured Merge-Tree to maintain its persistent data. Range consists of its own set of streams in the stream layer, and the streams belong solely to a given Range Metadata Stream The metadata stream is the root stream for a Range. The PM assigns a partition to a PS by providing the name of the Range s metadata stream Commit Log Stream Is a commit log used to store the recent insert, update, and delete operations applied to the Range since the last checkpoint was generated for the Range. Row Data Stream Stores the checkpoint row data and index for the Range. 15

Stream Layer Append-Only Distributed File System Streams are very large files Has file system like directory namespace Stream Operations Open, Close, Delete Streams Rename Streams Concatenate Streams together Append for writing Random reads 16

Stream Layer Concepts Min unit of write/read Checksum Up to N bytes (e.g. 4MB) Extent Unit of replication Sequence of blocks Size limit (e.g. 1GB) Sealed/unsealed Stream Hierarchical namespace Ordered list of pointers to extents Append/Concatenate Stream //foo/myfile.data Ptr E1 Ptr E2 Ptr E3 Ptr E4 Extent E1 Extent E2 Extent E3 Extent E4

Creating an Extent Paxos Layer Create Stream/Extent EN1 Primary EN2, EN3 Secondary SM Stream SM Master Allocate Extent replica set EN 1 EN 2 EN 3 EN Primary Secondary A Secondary B

Replication Flow Paxos Layer EN1 Primary EN2, EN3 Secondary SM SM SM Ack Append EN 1 EN 2 EN 3 EN Primary Secondary A Secondary B

Providing Bit-wise Identical Replicas Want all replicas for an extent to be bit-wise the same, up to a committed length Want to store pointers from the partition layer index to an extent+offset Want to be able to read from any replica Replication flow All appends to an extent go to the Primary Primary orders all incoming appends and picks the offset for the append in the extent Primary then forwards offset and data to secondaries Primary performs in-order acks back to clients for extent appends Primary returns the offset of the append in the extent An extent offset can commit back to the client once all replicas have written that offset and all prior offsets have also already been completely written This represents the committed length of the extent

Dealing with Write Failures Failure during append 1. Ack from primary lost when going back to partition layer Retry from partition layer can cause multiple blocks to be appended (duplicate records) 2. Unresponsive/Unreachable Extent Node (EN) Append will not be acked back to partition layer Seal the failed extent Allocate a new extent and append immediately Stream //foo/myfile.dat Ptr E1 Ptr E2 Ptr E3 Ptr E4 Ptr E5 Extent E1 Extent E2 Extent E3 Extent E4 Extent E5

Extent Sealing (Scenario 1) Layer Seal Extent Paxos SM Stream SM Master Seal Extent Sealed at 120 Append 120 120 Ask for current length EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Extent Sealing (Scenario 1) Layer Paxos SM Stream SM Master Seal Extent Sealed at 120 120 Sync with SM EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Extent Sealing (Scenario 2) Layer Seal Extent Paxos SM SM SM Seal Extent Sealed at 100 Append 120 Ask for current length 100 EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Extent Sealing (Scenario 2) Layer Paxos SM SM SM Seal Extent Sealed at 100 100 Sync with SM EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Providing Consistency for Data Streams For Data Streams, Layer only reads from offsets returned from successful appends Committed on all replicas Row and Blob Data Streams Offset valid on any replica SM SM SM Server Safe to read from EN3 EN 1 EN 2 EN 3 Network partition PS can talk to EN3 SM cannot talk to EN3 Primary Secondary A Secondary B

Providing Consistency for Log Streams Logs are used on partition load Commit and Metadata log streams Check commit length first Only read from Unsealed replica if all replicas have the same commit length A sealed replica SM SM SM Seal Extent Check commit length Check commit length Server Use EN1, EN2 for loading EN 1 EN 2 EN 3 Network partition PS can talk to EN3 SM cannot talk to EN3 Primary Secondary A Secondary B

Summary Highly Available Cloud Storage with Strong Consistency Scalable data abstractions to build your applications Blobs Files and large objects Tables Massively scalable structured storage Queues Reliable delivery of messages More information at: http://www.sigops.org/sosp/sosp11/current/2011- Cascais/11-calder-online.pdf