Handling heterogeneous storage devices in clusters

Handling heterogeneous storage devices in clusters André Brinkmann University of Paderborn Toni Cortes Barcelona Supercompu8ng Center

Randomized Data Placement Schemes n Randomized Data Placement Schemes Introduc8on Randomiza8on Balls into bins Randomized Data Placement Schemes Distributed Hash Tables Consistent Hashing and Share Redundancy and Randomized Data Placement Schemes Distributed Metadata Management

Introduc?on Randomiza?on n Determinis?c data placement schemes suffered many drawbacks for a long?me Heterogeneity has been an issue It has been costly to adapt to new storage systems It is difficult to support storage- on- demand concepts n Is there an alterna?ve to determinis?c schemes? n Yes, Randomiza?on can help to overcome these drawbacks, but new challenges are introduced!

Balls into bins Games I n Basic tasks of balls into bins games Assign a set of m balls to n bins n Mo?va?on Bins = Hard disks Balls = Data items L = max number of data items on each disk 0 1 2 3 4 Where should I place the next item??

Balls into bins Games II n Basic Results: Assign n balls to n bins For every ball, choose one bin independently, uniformly at random Maximum load is sharply concentrated: where w.h.p. abbreviates with probability at least, for any fixed

Balls into bins Games III n This sounds terrible: The maximum loaded hard disk stores - 8mes more data than the average This seems not to be not scalable, or n The model assumes that only very few data items are stored inside the environment, but each disk is able to store many objects Let s assume that many objects means Then it holds w.h.p. that see, e.g, M. Raab, A. Steger: Balls into Bins - A Simple and Tight Analysis

Distributed Hash Tables n Randomiza?on introduces some (well known) challenges n Key ques?ons are: How can we retrieve a stored data item? How can we adapt to a changing number of disks? How can we handle heterogeneity? How can we support redundancy? n Key Tasks of Distributed Hash Tables (DHTs)

Consistent Hashing I n Introduced in the context of Web Caching n Bins are mapped by a pseudo- random hash func?on h: on a ring of length 1 n Bins become responsible for their interval n Balls are mapped by an addi?onal hash func?on g: onto the ring n Each bin stores balls in its interval See D. Karger, E. Lehman et al.: Consistent Hashing and Random Trees: Tools for Relieving Hot Spots on the World Wide Web 3 4 1 5 6 2

Consistent Hashing II n Average load of each bin is, but devia?on from average can be high: The maximum arc length on the ring becomes w.h.p. n Solu?on: Each bin is mapped by a set of independent hash func?ons to mul?ple points on the ring The maximum arc length assigned to a bin can be reduced to for an arbitrary small constant, if virtual bins are used for each physical bin See I. Stoica, R. Morris, et al.: Chord: A Scalable Peer- To- Peer Lookup Service for Internet Applica8ons.

Join and Leave- Opera?ons I n In a dynamic network, nodes can join and leave any?me n The main goal of a DHT is to have the ability to locate every key in the network at (nearly) any?me n (Planned) removal of bins changes the length of their neighbor intervals Data has to be moved to neighbor n Inser?on of bins changes interval length of their new neighbors 7 3 1 6 2 4 5

Join and Leave- Opera?ons II n Defini?on of a View V: A view V is a set of bins of which a particular client is aware of. n Monotonicity: A ranged hash function f is monotone if for all views implies n Monotonicity implies that in case of a join opera?on of a bin i, all moved data items have des?na?on i n Consistent Hashing has property of monotonicity

Heterogeneous Bins n Consistent Hashing is (nearly) op?mally suited for homogeneous environment, where all bins (disks) have same capacity and performance n Heterogeneous bins can be mapped to Consistent Hashing by using a different number of virtual bins for each physical bin n The rela?on between the number of different bins constantly changes n Monotonicity (and some other proper?es) can not be kept up

Share Strategy I g(d) l(c d ) 0 1 d p o n Share Strategy tries to map heterogeneous problem to homogeneous solu?on n Each bin d is assigned by a hash func?on g: to a start point g(d) inside [0,1)- interval n The length l of the interval is propor?onal to the capacity c i (performance, or other metric) of bin i See A. Brinkmann, K. Salzwedel, C. Scheideler: Compact, adap8ve placement schemes for non- uniform distribu8on requirements.

Share Strategy II 0 x h(x) n How to retrieve loca?on of a data item x inside this heterogeneous sebng? n Use hash func?on h: to map x to [0,1)- Interval n Use DHT for homogeneous bins to retrieve loca?on of x from all intervals cubng h(x)

Share Strategy III 0 x h(x) n Proper?es: (Arbitrary) op8mal distribu8on of balls and bins Computa8onal Complexity in O(1) Compe88ve Ra8o concerning Join and Leave is (1+ε) for every ε>0 n But: Share has been op8mized for usage in data center environments Share is not monotone and only par8ally suited for P2P networks

V:Drive SAN MDA n V:Drive out- of- band virtualiza8on environment each (Linux) server includes addi8onal block- level driver module metadata appliance ensures consistent view on storage and servers Share strategy used as data distribu8on strategy See A. Brinkmann, S. Effert, et al.: Influence of Adap8ve Data Layouts on Performance in dynamically changing Storage Environments

Performance V:Drive - Sta?c Throughput (MB/s) Synthe8c random I/O benchmark, sta8c configura8on Physical Volumes VDrive LVM Avg. latency (ms) Physical volumes VDrive LVM

Performance V:Drive Dynamic Throughput (MB/s) Synthe8c random I/O benchmark, dynamic configura8on Avg. latency (ms) Physical volumes VDrive LVM Physical volumes VDrive LVM

V:Drive - Reconfigura?on Overhead

Randomiza?on and Redundancy n Randomized data distribu?on schemes do not include mechanisms to safe data against dist failures n Ques?on: How to use Randomiza8on and RAID schemes together n Assump?on: n copies of a data block have to be distributed over n disks No two copies of a data block are allowed to be stored on the same disk

Trivial Solu?ons n Trivial Solu?on I: Divide storage systems into n storage pools Distribute first copies over first pool,, n- th copies over n- th pool Ø Missing flexibility n Trivial Solu?on II: First copy will be distributed over all disks Second copy will be distributed about all but the previously chosen disk, Ø Not able to use capacity efficiently p = ( 1 2 ) 3 p = ( 1 1 ) 2 p = Second Copy ( 1 1 ) 4 First Copy

Observa?on n Trivial Solu?on II is not able to use capacity efficiently, because big storage systems will be penalized compared to smaller devices n Theorem: Assume a trivial replication strategy that has to distribute k copies of m balls over n > k bins. Furthermore, the biggest bin has a capacity c max that is at least (1 + ε) c j of the next biggest bin j. In this case, the expected load of the biggest bin will be smaller than the expected load required for an optimal capacity efficiency. See A. Brinkmann, S. Effert, et al.: Dynamic and Redundant Data Placement

Idea n Algorithm has to ensure that bigger bins get data items according to their capaci?es n This can be ensured by an algorithm that iterates over a sorted list of bins 1. At each itera8on, the algorithm randomly decides, whether or whether not to place the ball 2. If one of k copies of a ball has been placed, use op8mal strategy for (k- 1) with remaining bins as input n Challenge: How to make random decision in step 1 of each itera8on

Example for Mirroring (k=2) 100 GB 100 GB 80 GB 80 GB 60 GB 0.24 0.24 0.19 0.19 0.14 0.24 0.31 0.36 0.57 1.00 0.48 0.62 0.72 1.14 2.00 n n n denotes the rela?ve capacity of disk i to all disks denotes the rela?ve capacity of disk i to all disks star?ng with index i is the weight for the random decision!

Example for Mirroring (k=2) 100 GB 100 GB 80 GB 80 GB 60 GB 0.24 0.24 0.19 0.19 0.14 0.24 0.31 0.36 0.57 1.00 0.48 0.62 0.72 1.14 2.00 n If, e.g., disk 2 is chosen as first copy of a mirror, just distribute the second copy according to Share over disks 3, 4, and 5 n Some adapta?on is necessary, if disk 3 is chose, because weight of disk 4 is greater 1

Observa?ons 100 GB 100 GB 80 GB 80 GB 60 GB 0.24 0.24 0.19 0.19 0.14 0.24 0.31 0.36 0.57 1.00 0.48 0.62 0.72 1.14 2.00 n Strategy can easily be extended to arbitrary k n Data distribu?on is op?mal n Redistribu?on of data in dynamic environment is k 2 - compe??ve n Computa?onal complexity can be reduced to O(k)

Fairness of k- fold Replica?on

Adap?vity of k- fold Replica?on

Metadata Management n Assignment of data items to disks can be solved efficiently for random data distribu?on schemes Very good distribu8on of data and requests Computa8onal complexity low Adap8vity to new infrastructures op8mal without redundancy, ok with redundancy Over- provisioning can be efficiently integrated n but how to find posi?on of data item on the disks? Equal to the dic8onary problem Requires O(n) entries to find loca8on of n objects! Defines bulk set of metadata

Dic?onary Problem Extent Size vs. Volume Size 4 KB 16 KB 256 KB 4MB 16MB 256 MB 1 GB 1 GB 8 MB 2 MB 128 KB 8 KB 2 KB 128 Byte 32 Byte 64 GB 512 MB 128 MB 8 MB 512 KB 128 KB 8 KB 2 KB 1 TB 8 GB 2 GB 128 MB 8 MB 2 MB 128 KB 32 KB 64 TB 512 GB 128 GB 8 GB 512 MB 128 MB 8 MB 2 MB 1 PB 8 TB 2 TB 128 GB 8 GB 2 GB 128 MB 32 MB n Extent: Smallest con?nuous unit that can be addressed by virtualiza?on solu?on n Dic?onary easily becomes too big to be stored inside each server system for small extent sizes n Solu?ons Caching Huge extent sizes Object Based Storage Systems

Summary and Conclusions n Introduc?on into Disk Arrays n Why Heterogeneity? n Determinis?c Data Placement Schemes n Randomized Data Placement Schemes n Summary and Conclusions

Summary n Problem to be solved: scalable storage systems suppor?ng heterogeneous devices n Two solu?ons developed concurrently Determinis8c Modify RAID technology keeping its flavor Non- determinis8c Distribute data blocks by using randomiza8on RAID encoding on top of randomiza8on process

Conclusions n Advantages of each version Determinis8c Easy metadata management Easy recovery Non- determinis8c Good support for storage- on- demand concepts Less probability to get to a degraded state? n Both approaches are complementary concerning the advantages, but have many similari?es A zone is very similar to a group of extents Not fully described in the tutorial n Next step: Work on a mixed version

Bibliography I n A. Brinkmann, S. Effert, F. Meyer auf der Heide, C. Scheideler: Dynamic and Redundant Data Placement. In Proceedings of the 27th IEEE Interna8onal Conference on Distributed Compu8ng Systems (ICDCS ), 2007 n A. Brinkmann, S. Effert, M. Heidebuer, M. Vodisek: Influence of Adap?ve Data Layouts on Performance in dynamically changing Storage Environments. In Proceedings of the 14th Euromicro Conference on Parallel, Distributed and Network based Processing, 2006 n A. Brinkmann, K. Salzwedel, C. Scheideler: Compact, adap?ve placement schemes for non- uniform distribu?on requirements. In Proceedings of the 14th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2002 n T. Cortes and J. Labarta: Taking Advantage of Heterogeneity in Disk Arrays: Journal on Parallel and Distributed Compu8ng (JPDC), Volume 63, number 4, pp. 448-464, April 2003 n J.L. Gonzalez and Toni Cortes: An Adap?ve Data Block Placement based on Determinis?c Zones (Adap?veZ): Interna8onal Conference on Grid compu8ng, high- performance and Distributed Applica8ons (GADA'07) Vilamoura, Algarve, Portugal, Nov 29-30, 2007

Bibliography II n J. L. Gonzalez, T. Cortes: Evalua?ng the Effects of Upgrading Heterogeneous Disk Arrays: Interna8onal Symposium on Performance Evalua8on of Computer and Telecommunica8on Systems (SPECTS 2006), Calgary, Canada, July 31 - August 2, 2006 n M. Holland G.A. Gibson: Parity declustering for con?nuous opera?on in redundant disk arrays: In Proceedings of the fish interna8onal conference on Architectural support for programming languages and opera8ng systems, Boston, Massachusets, 1992 n D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy: Consistent Hashing and Random Trees: Tools for Relieving Hot Spots on the World Wide Web. In Proceedings of Symposium on Theory of Compu8ng (STOC), 1997. n Peter Lyman and Hal R. Varian. How much informa?on 2003?. School of Informa8on Management and Systems. University of California at Berkeley n D. A. Paterson, G. A. Gibson, R. H. Katz: A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the Interna8onal Conference on Management of Data (SIGMOD), 1988

Bibliography III n M. Raab, A. Steger: Balls into Bins - A Simple and Tight Analysis. In Proceedings of the 2nd Workshop on Randomiza8on and Approxima8on Techniques in Computer Science (RANDOM'98), 1998 n I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan: Chord: A Scalable Peer- To- Peer Lookup Service for Internet Applica?ons. In Proceedings of the 2001 ACM SIGCOMM Conference, 2001 n Ron Yellin. The data storage evolu?on. Has disk capacity outgrown its usefulness? Terada magazine 2006