CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

CS 655: ADVANCED TOPICS IN DISTRIBUTED SYSTEMS Shrideep Pallickara Computer Science Colorado State University PROFILING HARD DISKS L4.1 L4.2 Characteristics of peripheral devices & their speed relative to the CPU Item time Scaled time in human terms (2 billion times slower) Processor cycle 0.5 ns (2 GHz) 1 second Cache access 1 ns (1 GHz) 2 seconds Memory access 15 ns 30 seconds Context switch 5,000 ns (5 μs) 167 minutes Mechanical nature of disks limits their performance Disk access times have not decreased exponentially Processor speeds are growing exponentially Disparity between processor and disk access times continues to grow 1:14,000,000 Disk access 7,000,000 ns (7 ms) 162 days Quantum 100,000,000 ns (100 ms) 6.3 years L4.3 L4.4 Disk IO profile for commodity hard disks Seek time 9ms SPIN: 7200 10000 RPM Transfer rate Disk-to-buffer: 70 MB/sec Buffer-to-Computer: 300 MB/sec (SATA) Mean time between failures - 600,000 hours 1 TB capacity for $90 Centralized file servers = Network IO + disk IO Demand Pulls Process and store large data volumes Y02 22-EB : Y06 161-EB : Y10 988-EB ~ I ZB Manage concurrency 1000s of concurrent users Reduce latencies Shield users from the complexity of the system Uncertainty, failures, etc. L4.5 L4.6 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.1

Technology Pushes Falling hardware costs Improved reliability of networks Increased capacity of individual hard drives and networks Broad brushstroke themes in current extreme scale storage systems Voluminous data Commodity hardware Distributed Data Expect failures Tune for access by applications Optimize for dominant usage Tradeoff between consistency and availability L4.7 L4.8 Distributed File Systems (DFS) allow sharing of physically dispersed files Service activity has to be carried over the network Distinctive features Multiplicity Autonomy Dispersion Clients, Servers & Storage devices Systems we will be looking at Andrew File System Google File System Amazon s Dynamo L4.9 L4.10 Commonalities Think of the system as a meta file system Storage model is based on files that are managed by the underlying Unix/Linux based OS. Support for replication Contrast Dimensions Type of data that is stored Design considerations Throughput management Use of caching Exploiting network topologies Replication scheme Consistency model L4.11 L4.12 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.2

Data managed by system Andrew File System (AFS) Moderate number of small files Google File System Moderate number of very large files Amazon Dynamo Large numbers of small files TYPES OF DATA TO BE STORED L4.13 L4.14 AFS Design Considerations Files are small Read operations more common than writes Six times more common Sequential access is common Random access is rare DESIGN CONSIDERATIONS L4.15 L4.16 AFS Design Considerations Most files are read and written by only 1 user When a file is shared, it is usually only one user who modifies it Files are referenced in bursts If a file has been referenced recently, there is a high probability that it will be referenced again n In the near future GFS Design Considerations Component failures are the norm Files are huge by traditional standards File mutations predominantly through appends Not overwrites Applications and File system API designed in lock-step L4.17 L4.18 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.3

GFS Design Considerations Hundreds of producers will concurrently append to a file Many-way merging High sustained bandwidth is more important than low latency Dynamo design considerations Completely decentralized system Store large number of small files Provide a key-value store Underlying technology for several core services Scale to extreme peak loads Holiday shopping period No downtimes 3 million checkouts per day Data from 2007 L4.19 L4.20 Dynamo design considerations Best seller lists Shopping carts Customer preferences Session management Product catalog System Interface Store objects with a key get() and put() get(key) Locates objects replicas associated with key Returns single or list of objects n Conflicting versions along with context L4.21 L4.22 Key strategy for achieving scalability in AFS is caching files at clients THROUGHPUT CONSIDERATIONS Whole-file serving Entire contents of directories and files are transmitted to client computers by AFS servers n AFS-3: Files larger than 64KB are transferred in 64 KB chunks Whole-file caching Once a file or a chunk has been transferred to a client computer n Stored in a cache on local disk L4.23 L4.24 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.4

AFS Client cache Contains several files most recently used on that computer Cache is permanent Survives reboots of the computer Local copies of files are used to satisfy clients open requests In preference to remote copies AFS Caching Local cache remains valid for long periods Shared files that are infrequently updated n UNIX commands and libraries Files normally accessed by a single user n Most files in a user s home directory and its sub-tree These classes of files account for overwhelming majority of file accesses Local cache is allocated substantial proportion of disk space Several 100s of MB L4.25 L4.26 In GFS a file is broken up into fixedsize chunks Obvious reason The file is too big Map-Reduce Set the stage for computations that operate on this data Parallel I/O I/O seek times are 14 x 10 6 slower than CPU access times GFS Data flow is decoupled from the control flow Utilize each machine s network bandwidth Avoid network bottlenecks Avoid high-latency links Leverage network topology Estimate distances from IP addresses L4.27 L4.28 GFS client code implements the file system API Communications with master and chunk servers done transparently On behalf of apps that read or write data Interact with master for metadata Data-bearing communications directly to chunk servers Dynamo uses a variant of consistent hashing Introduces the notion of virtual nodes Virtual node looks like a real node Each node is responsible for more than 1 virtual nodes A node is assigned multiple positions in the ring L4.29 L4.30 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.5

Advantages of virtual nodes If a node becomes unavailable Load handled by failed node, dispersed across remaining virtual nodes When node becomes available again Accepts roughly the same amount of work from other nodes Number of virtual nodes are decided based on machine s capacity REPLICATION SCHEME L4.31 L4.32 Replication: Rationale Redundancy improves availability Reliability too Improves performance Choose a closer replica for communications n Better network pipe n Faster interactions Replication: Basic requirements File replicas reside on failure-independent machines Availability of one replica should not affect another Existence of replica invisible to higher-levels Hide details of the replication scheme Mapping handled by the naming schemes L4.33 L4.34 AFS Temporary caches serve as replicas The primary (responsible for conflict resolution) maintains the legitimate version GFS Chunk replica creation Place replicas on chunk servers with below average disk space utilization Limit the number of recent creations on a chunk server Predictor of imminent heavy traffic Spread replicas across racks L4.35 L4.36 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.6

Re-replicate chunks when replication level drops How far is it from replication goal Preference for chunks of live files Boost priority of chunks blocking client progress Rebalancing replicas Examine current replica distribution and move replicas Better disk space Load balancing Removal of existing replicas Chunk servers with below-average disk space L4.37 L4.38 Incorporating a new chunk server Do not swamp new server with lots of chunks Concomitant traffic will bog down the machine Gradually fill up new server with chunks Dynamo replicates data on multiple hosts Each data item is replicated at N hosts Coordinator is responsible for nodes that fall in its range Additionally, a coordinator replicates key at N-1 clockwise successor nodes L4.39 L4.40 What does this mean? Each node is responsible for region between Itself and its N th predecessor List of nodes responsible for a key Preference list A node maintains list of more than N to account for failures Account for virtual nodes n Make sure your list contains different physical nodes CONSISTENCY SCHEME L4.41 L4.42 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.7

AFS consistency Based on session semantics Changes are not visible to other clients till such time that the changes are flushed to the server Introduces the notion of callbacks Renewals GFS: Traditional writes Client specifies offset at which data needs to be written Concurrent writes to same region Not serializable Region ends up containing data fragments from multiple clients L4.43 L4.44 GFS: Atomic record appends Client specifies only the data not the offset GFS appends it to the file At least once atomically At an offset of GFS choosing GFS has a relaxed consistency model Consistent: See the same data On all replicas Defined: If it is consistent AND Clients see mutation writes in its entirety No need for a distributed lock manger L4.45 L4.46 GFS: File state region after a mutation GFS consistency implications for applications Write Record Append Rely on appends instead of overwrites Checkpoint Serial success Concurrent success defined Consistent but undefined defined interspersed with inconsistent Write records that are Self-validating Self-identifying Failure Inconsistent L4.47 L4.48 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.8

Conflict resolution in traditional stores: Done during writes Read complexity is kept simple Writes may be rejected if data store cannot reach majority of replicas At the same time Conflict resolution in Dynamo: When? Data store must be always writeable Rejecting customer updates? n Poor customer experience n $$$$ Shopping cart edits must be allowed Even during network and server failures Complexity of conflict resolution pushed to reads L4.49 L4.50 Conflict resolution in Dynamo: Who? Data store? Last write wins for conflicting updates Application? Aware of the data schema Decide on most suitable conflict resolution E.g.: Application that maintains shopping carts? Merge conflicting versions, and return unified cart Dynamo treats each modification as a new, immutable version of the data Multiple versions of data present at same time Often new versions subsume old data Syntactic reconciliation When automatic reconciliation is not possible Clients have to do it Collapse branches into one Manage your shopping cart L4.51 L4.52 Dynamo uses vector clocks to capture causality A vector clock for each version of the object Two versions of object being compared If VC 1 <= VC 2 for all indices of the vector clock n O 1 occurred before O 2 Otherwise, changes are in conflict n Need reconciliation Quorum-based protocols: When there are N replicas Read quorum N R To modify a file write-quorum N W N R + N W > N Prevent read-write conflict N W > N/2 Prevent write-write conflict L4.53 L4.54 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.9

Common configuration of the quorum N R =2 N W =2 N=3 INEFFICIENCIES L4.55 L4.56 AFS GFS: The master server is a single point of failure User volumes now would be in the order of 100s of GB Moving user volumes would be prohibitive Works well for situations where files are small Caching mechanism would be seriously impacted with large files Master server restart takes several seconds Shadow servers exist Can handle reads of files n In place of the master But not writes Requires a massive main memory L4.57 L4.58 GFS: Optimized for large files But not for a very large number of very small files Primary operation on files Long, sequential reads/writes Large number of random overwrites will clog things up quite a bit Consistency Issues: GFS expects clients to resolve inconsistencies File chunks may have gaps or duplicates of some records The client has to be able to deal with this Imagine doing this for a scientific application Portions of a massive array are corrupted n Clients would have to detect this n Detection is possible of course, but onerous! L4.59 L4.60 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.10

Dynamo Replication scheme does not network topologies into account No support for application-specific consistency resolution Strong for some cases, weak for others Security model None in GFS and Dynamo Operation is expected to be in a trusted environment L4.61 L4.62 Extensions Support for extremely-large number of very small files Sensor data Content within the data being stored is treated as a block box No support for content-based querying POSSIBLE EXTENSIONS L4.63 L4.64 Extensions No concept of similarity of data items Advantages of storing similar data in close network proximity Useful during query evaluations to avoid performance hotspots Push-based notifications References Andrew S Tanenbaum. Modern Operating Systems. 3rd Edition, 2007. Prentice Hall. ISBN: 0136006639/978-0136006633. The Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leun. Dynamo: Amazon s Highly Available Key-value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels The Andrew File System. L4.65 L4.66 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.11

QUESTIONS L4.67 SLIDES CREATED BY: SHRIDEEP PALLICKARA L1.12