Handling heterogeneous storage devices in clusters

Similar documents
Dynamic and Redundant Data Placement (Extended Abstract)

hashfs Applying Hashing to Op2mize File Systems for Small File Reads

6 Distributed data management I Hashing

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CCW Workshop Technical Session on Mobile Cloud Compu<ng

Distributed Hash Table

Outline. Spanner Mo/va/on. Tom Anderson

Technical Deep-Dive in a Column-Oriented In-Memory Database

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Latest Trends in Database Technology NoSQL and Beyond

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Semester Thesis on Chord/CFS: Towards Compatibility with Firewalls and a Keyword Search

Distributed Two-way Trees for File Replication on Demand

Today s Objec2ves. Kerberos. Kerberos Peer To Peer Overlay Networks Final Projects

Building a low-latency, proximity-aware DHT-based P2P network

A Simple Fault Tolerant Distributed Hash Table

Origin- des*na*on Flow Measurement in High- Speed Networks

Chord: A Scalable Peer-to-peer Lookup Service For Internet Applications

A Structured Overlay for Non-uniform Node Identifier Distribution Based on Flexible Routing Tables

Lecture 15 October 31

Load Sharing in Peer-to-Peer Networks using Dynamic Replication

RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers

Dynamic Load Sharing in Peer-to-Peer Systems: When some Peers are more Equal than Others

Effec%ve Replica Maintenance for Distributed Storage Systems

Decentralized Object Location In Dynamic Peer-to-Peer Distributed Systems

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

A Framework for Peer-To-Peer Lookup Services based on k-ary search

Athens University of Economics and Business. Dept. of Informatics

Flexible Information Discovery in Decentralized Distributed Systems

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Storwize in IT Environments Market Overview

Time-related replication for p2p storage system

PhD in Computer And Control Engineering XXVII cycle. Torino February 27th, 2015.

Distriubted Hash Tables and Scalable Content Adressable Network (CAN)

Consistent Hashing. Overview. Ranged Hash Functions. .. CSC 560 Advanced DBMS Architectures Alexander Dekhtyar..

Today s Objec4ves. Data Center. Virtualiza4on Cloud Compu4ng Amazon Web Services. What did you think? 10/23/17. Oct 23, 2017 Sprenkle - CSCI325

Today s Objec2ves. AWS/MR Review Final Projects Distributed File Systems. Nov 3, 2017 Sprenkle - CSCI325

Effect of Links on DHT Routing Algorithms 1

Ar#ficial Intelligence

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

HP AutoRAID (Lecture 5, cs262a)

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Register Alloca.on Deconstructed. David Ryan Koes Seth Copen Goldstein

OPTIMAL ROUTING VS. ROUTE REFLECTOR VNF - RECONCILE THE FIRE WITH WATER

SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers

Fluxo. Improving the Responsiveness of Internet Services with Automa7c Cache Placement

LessLog: A Logless File Replication Algorithm for Peer-to-Peer Distributed Systems

CSc 120. Introduc/on to Computer Programming II. 15: Hashing

Implementation and Performance Evaluation of RAPID-Cache under Linux

A Scalable Content- Addressable Network

Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn

Advanced Linux System Administra3on

CS261 Data Structures. Maps (or Dic4onaries)

: Scalable Lookup

MapReduce. Cloud Computing COMP / ECPE 293A

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Staggeringly Large File Systems. Presented by Haoyan Geng

An Empirical Study of Data Redundancy for High Availability in Large Overlay Networks

Pyro: A Spatial-Temporal Big-Data Storage System. Shen Li Shaohan Hu Raghu Ganti Mudhakar Srivatsa Tarek Abdelzaher

Current Topics in OS Research. So, what s hot?

Early Measurements of a Cluster-based Architecture for P2P Systems

Trustworthy Keyword Search for Regulatory Compliant Records Reten;on

Monitoring IPv6 Content Accessibility and Reachability. Contact: R. Guerin University of Pennsylvania

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng. Prof. Natalie Enright Jerger

CSE Opera+ng System Principles

RAT Selec)on Games in HetNets

Structured Superpeers: Leveraging Heterogeneity to Provide Constant-Time Lookup

Virtualization. Introduction. Why we interested? 11/28/15. Virtualiza5on provide an abstract environment to run applica5ons.

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

CITS4009 Introduc0on to Data Science

Example. You manage a web site, that suddenly becomes wildly popular. Performance starts to degrade. Do you?

Adaptive Load Balancing for DHT Lookups

TerraSwarm. A Machine Learning and Op0miza0on Toolkit for the Swarm. Ilge Akkaya, Shuhei Emoto, Edward A. Lee. University of California, Berkeley

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD

M 2 R: Enabling Stronger Privacy in MapReduce Computa;on

Subway : Peer-To-Peer Clustering of Clients for Web Proxy

Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons

Ch06. NoSQL Part III.

Consistency Rationing in the Cloud: Pay only when it matters

Degree Optimal Deterministic Routing for P2P Systems

Web- Scale Mul,media: Op,mizing LSH. Malcolm Slaney Yury Li<shits Junfeng He Y! Research

A Directed-multicast Routing Approach with Path Replication in Content Addressable Network

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

HP AutoRAID (Lecture 5, cs262a)

There is a tempta7on to say it is really used, it must be good

Broadcas(ng Video in Dense g Networks Using Applica(on FEC and Mul(cast

Sta$c Analysis Dataflow Analysis

Simple Determination of Stabilization Bounds for Overlay Networks. are now smaller, faster, and near-omnipresent. Computer ownership has gone from one

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS

Mul$media Networking. #9 CDN Solu$ons Semester Ganjil 2012 PTIIK Universitas Brawijaya

Chapter 10: Mass-Storage Systems

Cluster-Level Google How we use Colossus to improve storage efficiency

Virtual Allocation: A Scheme for Flexible Storage Allocation

The Google File System

A Super-Peer Based Lookup in Structured Peer-to-Peer Systems

Click to edit Master title

Chapter 6 PEER-TO-PEER COMPUTING

CLOUD COMPUTING IT0530. G.JEYA BHARATHI Asst.Prof.(O.G) Department of IT SRM University

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Transcription:

Handling heterogeneous storage devices in clusters André Brinkmann University of Paderborn Toni Cortes Barcelona Supercompu8ng Center

Randomized Data Placement Schemes n Randomized Data Placement Schemes Introduc8on Randomiza8on Balls into bins Randomized Data Placement Schemes Distributed Hash Tables Consistent Hashing and Share Redundancy and Randomized Data Placement Schemes Distributed Metadata Management

Introduc?on Randomiza?on n Determinis?c data placement schemes suffered many drawbacks for a long?me Heterogeneity has been an issue It has been costly to adapt to new storage systems It is difficult to support storage- on- demand concepts n Is there an alterna?ve to determinis?c schemes? n Yes, Randomiza?on can help to overcome these drawbacks, but new challenges are introduced!

Balls into bins Games I n Basic tasks of balls into bins games Assign a set of m balls to n bins n Mo?va?on Bins = Hard disks Balls = Data items L = max number of data items on each disk 0 1 2 3 4 Where should I place the next item??

Balls into bins Games II n Basic Results: Assign n balls to n bins For every ball, choose one bin independently, uniformly at random Maximum load is sharply concentrated: where w.h.p. abbreviates with probability at least, for any fixed

Balls into bins Games III n This sounds terrible: The maximum loaded hard disk stores - 8mes more data than the average This seems not to be not scalable, or n The model assumes that only very few data items are stored inside the environment, but each disk is able to store many objects Let s assume that many objects means Then it holds w.h.p. that see, e.g, M. Raab, A. Steger: Balls into Bins - A Simple and Tight Analysis

Distributed Hash Tables n Randomiza?on introduces some (well known) challenges n Key ques?ons are: How can we retrieve a stored data item? How can we adapt to a changing number of disks? How can we handle heterogeneity? How can we support redundancy? n Key Tasks of Distributed Hash Tables (DHTs)

Consistent Hashing I n Introduced in the context of Web Caching n Bins are mapped by a pseudo- random hash func?on h: on a ring of length 1 n Bins become responsible for their interval n Balls are mapped by an addi?onal hash func?on g: onto the ring n Each bin stores balls in its interval See D. Karger, E. Lehman et al.: Consistent Hashing and Random Trees: Tools for Relieving Hot Spots on the World Wide Web 3 4 1 5 6 2

Consistent Hashing II n Average load of each bin is, but devia?on from average can be high: The maximum arc length on the ring becomes w.h.p. n Solu?on: Each bin is mapped by a set of independent hash func?ons to mul?ple points on the ring The maximum arc length assigned to a bin can be reduced to for an arbitrary small constant, if virtual bins are used for each physical bin See I. Stoica, R. Morris, et al.: Chord: A Scalable Peer- To- Peer Lookup Service for Internet Applica8ons.

Join and Leave- Opera?ons I n In a dynamic network, nodes can join and leave any?me n The main goal of a DHT is to have the ability to locate every key in the network at (nearly) any?me n (Planned) removal of bins changes the length of their neighbor intervals Data has to be moved to neighbor n Inser?on of bins changes interval length of their new neighbors 7 3 1 6 2 4 5

Join and Leave- Opera?ons II n Defini?on of a View V: A view V is a set of bins of which a particular client is aware of. n Monotonicity: A ranged hash function f is monotone if for all views implies n Monotonicity implies that in case of a join opera?on of a bin i, all moved data items have des?na?on i n Consistent Hashing has property of monotonicity

Heterogeneous Bins n Consistent Hashing is (nearly) op?mally suited for homogeneous environment, where all bins (disks) have same capacity and performance n Heterogeneous bins can be mapped to Consistent Hashing by using a different number of virtual bins for each physical bin n The rela?on between the number of different bins constantly changes n Monotonicity (and some other proper?es) can not be kept up

Share Strategy I g(d) l(c d ) 0 1 d p o n Share Strategy tries to map heterogeneous problem to homogeneous solu?on n Each bin d is assigned by a hash func?on g: to a start point g(d) inside [0,1)- interval n The length l of the interval is propor?onal to the capacity c i (performance, or other metric) of bin i See A. Brinkmann, K. Salzwedel, C. Scheideler: Compact, adap8ve placement schemes for non- uniform distribu8on requirements.

Share Strategy II 0 x h(x) n How to retrieve loca?on of a data item x inside this heterogeneous sebng? n Use hash func?on h: to map x to [0,1)- Interval n Use DHT for homogeneous bins to retrieve loca?on of x from all intervals cubng h(x)

Share Strategy III 0 x h(x) n Proper?es: (Arbitrary) op8mal distribu8on of balls and bins Computa8onal Complexity in O(1) Compe88ve Ra8o concerning Join and Leave is (1+ε) for every ε>0 n But: Share has been op8mized for usage in data center environments Share is not monotone and only par8ally suited for P2P networks

V:Drive SAN MDA n V:Drive out- of- band virtualiza8on environment each (Linux) server includes addi8onal block- level driver module metadata appliance ensures consistent view on storage and servers Share strategy used as data distribu8on strategy See A. Brinkmann, S. Effert, et al.: Influence of Adap8ve Data Layouts on Performance in dynamically changing Storage Environments

Performance V:Drive - Sta?c Throughput (MB/s) Synthe8c random I/O benchmark, sta8c configura8on Physical Volumes VDrive LVM Avg. latency (ms) Physical volumes VDrive LVM

Performance V:Drive Dynamic Throughput (MB/s) Synthe8c random I/O benchmark, dynamic configura8on Avg. latency (ms) Physical volumes VDrive LVM Physical volumes VDrive LVM

V:Drive - Reconfigura?on Overhead

Randomiza?on and Redundancy n Randomized data distribu?on schemes do not include mechanisms to safe data against dist failures n Ques?on: How to use Randomiza8on and RAID schemes together n Assump?on: n copies of a data block have to be distributed over n disks No two copies of a data block are allowed to be stored on the same disk

Trivial Solu?ons n Trivial Solu?on I: Divide storage systems into n storage pools Distribute first copies over first pool,, n- th copies over n- th pool Ø Missing flexibility n Trivial Solu?on II: First copy will be distributed over all disks Second copy will be distributed about all but the previously chosen disk, Ø Not able to use capacity efficiently p = ( 1 2 ) 3 p = ( 1 1 ) 2 p = Second Copy ( 1 1 ) 4 First Copy

Observa?on n Trivial Solu?on II is not able to use capacity efficiently, because big storage systems will be penalized compared to smaller devices n Theorem: Assume a trivial replication strategy that has to distribute k copies of m balls over n > k bins. Furthermore, the biggest bin has a capacity c max that is at least (1 + ε) c j of the next biggest bin j. In this case, the expected load of the biggest bin will be smaller than the expected load required for an optimal capacity efficiency. See A. Brinkmann, S. Effert, et al.: Dynamic and Redundant Data Placement

Idea n Algorithm has to ensure that bigger bins get data items according to their capaci?es n This can be ensured by an algorithm that iterates over a sorted list of bins 1. At each itera8on, the algorithm randomly decides, whether or whether not to place the ball 2. If one of k copies of a ball has been placed, use op8mal strategy for (k- 1) with remaining bins as input n Challenge: How to make random decision in step 1 of each itera8on

Example for Mirroring (k=2) 100 GB 100 GB 80 GB 80 GB 60 GB 0.24 0.24 0.19 0.19 0.14 0.24 0.31 0.36 0.57 1.00 0.48 0.62 0.72 1.14 2.00 n n n denotes the rela?ve capacity of disk i to all disks denotes the rela?ve capacity of disk i to all disks star?ng with index i is the weight for the random decision!

Example for Mirroring (k=2) 100 GB 100 GB 80 GB 80 GB 60 GB 0.24 0.24 0.19 0.19 0.14 0.24 0.31 0.36 0.57 1.00 0.48 0.62 0.72 1.14 2.00 n If, e.g., disk 2 is chosen as first copy of a mirror, just distribute the second copy according to Share over disks 3, 4, and 5 n Some adapta?on is necessary, if disk 3 is chose, because weight of disk 4 is greater 1

Observa?ons 100 GB 100 GB 80 GB 80 GB 60 GB 0.24 0.24 0.19 0.19 0.14 0.24 0.31 0.36 0.57 1.00 0.48 0.62 0.72 1.14 2.00 n Strategy can easily be extended to arbitrary k n Data distribu?on is op?mal n Redistribu?on of data in dynamic environment is k 2 - compe??ve n Computa?onal complexity can be reduced to O(k)

Fairness of k- fold Replica?on

Adap?vity of k- fold Replica?on

Metadata Management n Assignment of data items to disks can be solved efficiently for random data distribu?on schemes Very good distribu8on of data and requests Computa8onal complexity low Adap8vity to new infrastructures op8mal without redundancy, ok with redundancy Over- provisioning can be efficiently integrated n but how to find posi?on of data item on the disks? Equal to the dic8onary problem Requires O(n) entries to find loca8on of n objects! Defines bulk set of metadata

Dic?onary Problem Extent Size vs. Volume Size 4 KB 16 KB 256 KB 4MB 16MB 256 MB 1 GB 1 GB 8 MB 2 MB 128 KB 8 KB 2 KB 128 Byte 32 Byte 64 GB 512 MB 128 MB 8 MB 512 KB 128 KB 8 KB 2 KB 1 TB 8 GB 2 GB 128 MB 8 MB 2 MB 128 KB 32 KB 64 TB 512 GB 128 GB 8 GB 512 MB 128 MB 8 MB 2 MB 1 PB 8 TB 2 TB 128 GB 8 GB 2 GB 128 MB 32 MB n Extent: Smallest con?nuous unit that can be addressed by virtualiza?on solu?on n Dic?onary easily becomes too big to be stored inside each server system for small extent sizes n Solu?ons Caching Huge extent sizes Object Based Storage Systems

Summary and Conclusions n Introduc?on into Disk Arrays n Why Heterogeneity? n Determinis?c Data Placement Schemes n Randomized Data Placement Schemes n Summary and Conclusions

Summary n Problem to be solved: scalable storage systems suppor?ng heterogeneous devices n Two solu?ons developed concurrently Determinis8c Modify RAID technology keeping its flavor Non- determinis8c Distribute data blocks by using randomiza8on RAID encoding on top of randomiza8on process

Conclusions n Advantages of each version Determinis8c Easy metadata management Easy recovery Non- determinis8c Good support for storage- on- demand concepts Less probability to get to a degraded state? n Both approaches are complementary concerning the advantages, but have many similari?es A zone is very similar to a group of extents Not fully described in the tutorial n Next step: Work on a mixed version

Bibliography I n A. Brinkmann, S. Effert, F. Meyer auf der Heide, C. Scheideler: Dynamic and Redundant Data Placement. In Proceedings of the 27th IEEE Interna8onal Conference on Distributed Compu8ng Systems (ICDCS ), 2007 n A. Brinkmann, S. Effert, M. Heidebuer, M. Vodisek: Influence of Adap?ve Data Layouts on Performance in dynamically changing Storage Environments. In Proceedings of the 14th Euromicro Conference on Parallel, Distributed and Network based Processing, 2006 n A. Brinkmann, K. Salzwedel, C. Scheideler: Compact, adap?ve placement schemes for non- uniform distribu?on requirements. In Proceedings of the 14th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2002 n T. Cortes and J. Labarta: Taking Advantage of Heterogeneity in Disk Arrays: Journal on Parallel and Distributed Compu8ng (JPDC), Volume 63, number 4, pp. 448-464, April 2003 n J.L. Gonzalez and Toni Cortes: An Adap?ve Data Block Placement based on Determinis?c Zones (Adap?veZ): Interna8onal Conference on Grid compu8ng, high- performance and Distributed Applica8ons (GADA'07) Vilamoura, Algarve, Portugal, Nov 29-30, 2007

Bibliography II n J. L. Gonzalez, T. Cortes: Evalua?ng the Effects of Upgrading Heterogeneous Disk Arrays: Interna8onal Symposium on Performance Evalua8on of Computer and Telecommunica8on Systems (SPECTS 2006), Calgary, Canada, July 31 - August 2, 2006 n M. Holland G.A. Gibson: Parity declustering for con?nuous opera?on in redundant disk arrays: In Proceedings of the fish interna8onal conference on Architectural support for programming languages and opera8ng systems, Boston, Massachusets, 1992 n D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy: Consistent Hashing and Random Trees: Tools for Relieving Hot Spots on the World Wide Web. In Proceedings of Symposium on Theory of Compu8ng (STOC), 1997. n Peter Lyman and Hal R. Varian. How much informa?on 2003?. School of Informa8on Management and Systems. University of California at Berkeley n D. A. Paterson, G. A. Gibson, R. H. Katz: A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the Interna8onal Conference on Management of Data (SIGMOD), 1988

Bibliography III n M. Raab, A. Steger: Balls into Bins - A Simple and Tight Analysis. In Proceedings of the 2nd Workshop on Randomiza8on and Approxima8on Techniques in Computer Science (RANDOM'98), 1998 n I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan: Chord: A Scalable Peer- To- Peer Lookup Service for Internet Applica?ons. In Proceedings of the 2001 ACM SIGCOMM Conference, 2001 n Ron Yellin. The data storage evolu?on. Has disk capacity outgrown its usefulness? Terada magazine 2006