Distributed Systems. Tutorial 9 Windows Azure Storage

Distributed Systems Tutorial 9 Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012

Windows Azure Storage (WAS) A scalable cloud storage system In production since November 2008 used inside Microsoft for applications such as social networking search, serving video, music and game content, managing medical records and more Thousands of customers outside Microsoft Anyone can sign up over the Internet to use the system. 2

WAS Abstractions Blobs File system in the cloud Tables Massively scalable structured storage Queues Reliable storage and delivery of messages A common usage pattern is incoming and outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs. 3

Design goals Highly Available with Strong Consistency Provide access to data in face of failures/partitioning Durability Replicate data several times within and across data centers Scalability Need to scale to exabytes and beyond Provide a global namespace to access data around the world Automatically load balance data to meet peak traffic demands 4

Global ed Namespace http(s)://accountname.<service>.core.windows.net/name/ ObjectName <service> can be a blob, table or queue. AccountName is the customer selected account name for accessing storage. The Account name specifies the data center where the data is stored. An application may use multiple AccountNames to store its data across different locations. Name locates the data once a request reaches the storage cluster When a Name holds many objects, the ObjectName identifies individual objects within that partition The system supports atomic transactions across objects with the same Name value The ObjectName is optional since, for some types of data, the Name uniquely identifies the object within the account. 5

Storage Stamps A storage stamp is a cluster of N racks of storage nodes. Each rack is built out as a separate fault domain with redundant networking and power. Clusters typically range from 10 to 20 racks with 18 diskheavy storage nodes per rack. The first generation storage stamps hold approximately 2PB of raw storage each. The next generation stamps hold up to 30PB of raw storage each. 6

High Level Architecture Access blob storage via the URL: http://<account>.blob.core.windows.net/ Data access Storage Location Service LB LB Front-Ends Layer Stream Layer Intra-stamp replication Storage Stamp Inter-stamp (Geo) replication Front-Ends Layer Stream Layer Intra-stamp replication Storage Stamp 7

Storage Stamp Architecture Stream Layer Append-only distributed file system All data from the Layer is stored into files (extents) in the Stream layer An extent is replicated 3 times across different fault and upgrade domains With random selection for where to place replicas Checksum all stored data Verified on every client read Re-replicate on disk/node/rack failure or checksum mismatch Stream Layer (Distributed File System) 8 M M Paxos M Extent Nodes (EN)

Storage Stamp Architecture Partiton Layer Provide transaction semantics and strong consistency for Blobs, Tables and Queues Stores and reads the objects to/from extents in the Stream layer Provides inter-stamp (geo) replication by shipping logs to other stamps Scalable object index via partitioning Layer Master Lock Service 9 Server Server Server Server

Storage Stamp Architecture Front End Layer Stateless Servers Authentication + authorization Request routing 10

Storage Stamp Architecture Front End Layer FE Ack Incoming Write Request FE FE FE FE Layer Server Server Master Server Server Lock Service Stream Layer M M Paxos M Extent Nodes (EN) 11

Layer Scalable Object Index 100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp Need to efficiently enumerate, query, get, and update them Traffic pattern can be highly dynamic Hot objects, peak load, traffic bursts, etc Need a scalable index for the objects that can Spread the index across 100s of servers Dynamically load balance Dynamically change what servers are serving each part of the index based on load 12

Scalable Object Index via ing Layer maintains an internal Object Index Table for each data abstraction Blob Index: contains all blob objects for all accounts in a stamp Table Entity Index: contains all table entities for all accounts in a stamp Queue Message Index: contains all messages for all accounts in a stamp Scalability is provided for each Object Index Monitor load to each part of the index to determine hot spots Index is dynamically split into thousands of Index Ranges based on load Index Ranges are automatically load balanced across servers to quickly adapt to changes in load 13

Layer Index Range ing Blob Index Account Container Name Name Blob Name aaaa aaaa aaaaa............ Account.. Container harry pictures.. sunrise.. Blob Name Name Name.. Front-End.... harry pictures Server sunset........ A-H:.. PS1.... H -R:.. PS2.. Account Container Blob richard R -Z: Name videos PS3 Name soccer.... Name.. richard videos tennis........ Map.......... zzzz zzzz zzzzz Storage Stamp PS 1 Server A-H Map A-H: PS1 H -R: PS2 Master R -Z: PS3 Server Server H -R R -Z PS 2 PS 3

Layer Range A Range uses a Log-Structured Merge-Tree to maintain its persistent data. Range consists of its own set of streams in the stream layer, and the streams belong solely to a given Range Metadata Stream The metadata stream is the root stream for a Range. The PM assigns a partition to a PS by providing the name of the Range s metadata stream Commit Log Stream Is a commit log used to store the recent insert, update, and delete operations applied to the Range since the last checkpoint was generated for the Range. Row Data Stream Stores the checkpoint row data and index for the Range. 15

Stream Layer Append-Only Distributed File System Streams are very large files Has file system like directory namespace Stream Operations Open, Close, Delete Streams Rename Streams Concatenate Streams together Append for writing Random reads 16

Stream Layer Concepts Min unit of write/read Checksum Up to N bytes (e.g. 4MB) Extent Unit of replication Sequence of blocks Size limit (e.g. 1GB) Sealed/unsealed Stream Hierarchical namespace Ordered list of pointers to extents Append/Concatenate Stream //foo/myfile.data Ptr E1 Ptr E2 Ptr E3 Ptr E4 Extent E1 Extent E2 Extent E3 Extent E4

Creating an Extent Paxos Layer Create Stream/Extent EN1 Primary EN2, EN3 Secondary SM Stream SM Master Allocate Extent replica set EN 1 EN 2 EN 3 EN Primary Secondary A Secondary B

Replication Flow Paxos Layer EN1 Primary EN2, EN3 Secondary SM SM SM Ack Append EN 1 EN 2 EN 3 EN Primary Secondary A Secondary B

Providing Bit-wise Identical Replicas Want all replicas for an extent to be bit-wise the same, up to a committed length Want to store pointers from the partition layer index to an extent+offset Want to be able to read from any replica Replication flow All appends to an extent go to the Primary Primary orders all incoming appends and picks the offset for the append in the extent Primary then forwards offset and data to secondaries Primary performs in-order acks back to clients for extent appends Primary returns the offset of the append in the extent An extent offset can commit back to the client once all replicas have written that offset and all prior offsets have also already been completely written This represents the committed length of the extent

Dealing with Write Failures Failure during append 1. Ack from primary lost when going back to partition layer Retry from partition layer can cause multiple blocks to be appended (duplicate records) 2. Unresponsive/Unreachable Extent Node (EN) Append will not be acked back to partition layer Seal the failed extent Allocate a new extent and append immediately Stream //foo/myfile.dat Ptr E1 Ptr E2 Ptr E3 Ptr E4 Ptr E5 Extent E1 Extent E2 Extent E3 Extent E4 Extent E5

Extent Sealing (Scenario 1) Layer Seal Extent Paxos SM Stream SM Master Seal Extent Sealed at 120 Append 120 120 Ask for current length EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Extent Sealing (Scenario 1) Layer Paxos SM Stream SM Master Seal Extent Sealed at 120 120 Sync with SM EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Extent Sealing (Scenario 2) Layer Seal Extent Paxos SM SM SM Seal Extent Sealed at 100 Append 120 Ask for current length 100 EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Extent Sealing (Scenario 2) Layer Paxos SM SM SM Seal Extent Sealed at 100 100 Sync with SM EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B

Providing Consistency for Data Streams For Data Streams, Layer only reads from offsets returned from successful appends Committed on all replicas Row and Blob Data Streams Offset valid on any replica SM SM SM Server Safe to read from EN3 EN 1 EN 2 EN 3 Network partition PS can talk to EN3 SM cannot talk to EN3 Primary Secondary A Secondary B

Providing Consistency for Log Streams Logs are used on partition load Commit and Metadata log streams Check commit length first Only read from Unsealed replica if all replicas have the same commit length A sealed replica SM SM SM Seal Extent Check commit length Check commit length Server Use EN1, EN2 for loading EN 1 EN 2 EN 3 Network partition PS can talk to EN3 SM cannot talk to EN3 Primary Secondary A Secondary B

Summary Highly Available Cloud Storage with Strong Consistency Scalable data abstractions to build your applications Blobs Files and large objects Tables Massively scalable structured storage Queues Reliable delivery of messages More information at: http://www.sigops.org/sosp/sosp11/current/2011- Cascais/11-calder-online.pdf