HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies (FAST '09) February 26 th 2009 C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, M. Welnicki C. Ungureanu
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 2 Scalable secondary storage Characteristics Huge amount of data Small backup windows Duplication between backup streams Reliable, on-line retrieval Varying value of data Requirements - Scalability (dynamic) - Low cost per TB - Very high write performance - Global deduplication - Failure tolerance - High restore performance - Adjust resilience overhead - Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 3 Scalable secondary storage Characteristics Huge amount of data Small backup windows Duplication between backup streams Reliable, on-line retrieval Varying value of data Requirements - Scalability (dynamic) - Low cost per TB - Very high write performance - Global deduplication - Failure tolerance - High restore performance - Adjust resilience overhead - Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 4 Scalable secondary storage Characteristics Huge amount of data Small backup windows Duplication between backup streams Reliable, on-line retrieval Varying value of data Requirements - Scalability (dynamic) - Low cost per TB - Very high write performance - Global deduplication - Failure tolerance - High restore performance - Adjust resilience overhead - Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 5 Scalable secondary storage Characteristics Huge amount of data Small backup windows Duplication between backup streams Reliable, on-line retrieval Varying value of data Requirements - Scalability (dynamic) - Low cost per TB - Very high write performance - Global deduplication - Failure tolerance - High restore performance - Adjust resilience overhead - Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 6 Scalable secondary storage Characteristics Huge amount of data Small backup windows Duplication between backup streams Reliable, on-line retrieval Varying value of data Requirements - Scalability (dynamic) - Low cost per TB - Very high write performance - Global deduplication - Failure tolerance - High restore performance - Adjust resilience overhead - Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 7 Challenges High-performance, decentralized global deduplication... in a dynamic, distributed system... with deletion and failures Combination introduces complexity Tension between: Deduplication and dynamic scalability Deduplication and on-demand deletion Failure tolerance and deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 8 Satisfies Scalable secondary storage requirements Started as a research project at NEC Laboratories America, in Princeton, NJ Successfully commercialized Today: real-world, commercial system Sold by NEC in the US and Japan Development of back-end continues at 9LivesData, LLC in Warsaw, Poland Spinoff from NEC Laboratories
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 9 HYDRAstor functionality Content addressable storage (CAS) Vast data repository Storing and extracting streams of blocks Single system image built of independent nodes Support for standard access methods Filesystem, VTL Dynamic capacity sharing Self-recovery from failures On-demand deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 10 Programming Model Repository of blocks Content-addressed Immutable Variable-sized hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 11 Programming Model Repository of blocks Content-addressed Immutable Variable-sized Exposed pointers to other blocks E 011..0 hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 12 Programming Model Repository of blocks Content-addressed hash=010..1 Root1 E Immutable Variable-sized Exposed pointers to other blocks E Trees of blocks E 011..0 hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 13 Programming Model Repository of blocks Content-addressed hash=010..1 Root1 E Root2 E Immutable hash=110..0 Variable-sized Exposed pointers to other blocks Trees of blocks DAGs due to deduplication No cycles possible E E 011..0 011..0 hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 14 Programming Model Repository of blocks Content-addressed hash=010..1 Root1 E Root2 E Immutable hash=110..0 Variable-sized Exposed pointers to other blocks Trees of blocks DAGs due to deduplication No cycles possible E E 011..0 011..0 Deletion of whole trees hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 15 Programming Model Repository of blocks Content-addressed hash=010..1 Root1 E Root2 E Immutable hash=110..0 Variable-sized Exposed pointers to other blocks Trees of blocks DAGs due to deduplication No cycles possible E E 011..0 011..0 Deletion of whole trees hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 16 Programming Model Repository of blocks Content-addressed hash=010..1 Root1 E Root2 E Immutable hash=110..0 Variable-sized Exposed pointers to other blocks Trees of blocks DAGs due to deduplication No cycles possible E E 011..0 011..0 Deletion of whole trees hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 17 Programming Model Repository of blocks Content-addressed Root2 E Immutable hash=110..0 Variable-sized Exposed pointers to other blocks Trees of blocks DAGs due to deduplication No cycles possible E 011..0 011..0 Deletion of whole trees hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 18 Architecture overview Standard server-grade hardware running Linux Scalability on data-center level Access Nodes NFS / CIFS Front-end Internal Network Back-end (CAS Layer) Storage Nodes
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 19 Requirements on scalable storage Failure tolerance Data organization: selected requirements Required internal data services Identify data resilience reduction Fast data rebuilding High performance Preserve locality of data streams Prefetching Dynamic scalability Decentralized data management Load balancing Fast data transfer to new location Deduplication Location of potential duplicates Availability & resiliency verification On-demand deletion Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 20 Requirements on scalable storage Failure tolerance Data organization: selected requirements Required internal data services Identify data resilience reduction Fast data rebuilding High performance Preserve locality of data streams Prefetching Dynamic scalability Decentralized data management Load balancing Fast data transfer to new location Deduplication Location of potential duplicates Availability & resiliency verification On-demand deletion Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 21 Requirements on scalable storage Failure tolerance Data organization: selected requirements Required internal data services Identify data resilience reduction Fast data rebuilding High performance Preserve locality of data streams Prefetching Dynamic scalability Decentralized data management Load balancing Fast data transfer to new location Deduplication Location of potential duplicates Availability & resiliency verification On-demand deletion Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 22 Requirements on scalable storage Failure tolerance Data organization: selected requirements Required internal data services Identify data resilience reduction Fast data rebuilding High performance Preserve locality of data streams Prefetching Dynamic scalability Decentralized data management Load balancing Fast data transfer to new location Deduplication Location of potential duplicates Availability & resiliency verification On-demand deletion Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 23 Requirements on scalable storage Failure tolerance Data organization: selected requirements Required internal data services Identify data resilience reduction Fast data rebuilding High performance Preserve locality of data streams Prefetching Dynamic scalability Decentralized data management Load balancing Fast data transfer to new location Deduplication Location of potential duplicates Availability & resiliency verification On-demand deletion Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 24 Failure tolerance: erasure coding Block erasure-coded into N fragments Storage overhead tunable Example: N=8, m=5 Original block Encode Redundant Fragments Original Fragments Decode Any 3 fragments can be lost
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 25 Scalability with DHT: data placement Block location: DHT with prefix routing empty prefix 0 1 00 01 10 11
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 26 Scalability with DHT: data placement Block location: DHT with prefix routing Block mapped to hash prefix hash=011..0 empty prefix Block 0 1 00 01 10 11
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 27 Scalability with DHT: data placement Block location: DHT with prefix routing Block mapped to hash prefix Prefix components Hosted on SNs empty prefix 0 1 hash=011..0 Block N=4 N components per prefix Node 1 00 1 01 10 11 0 2 Node 12 3 1 1 Node 13 2 3 Node 14 0 0 2 Node 15 2 1 Node 16 3 3 0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 28 Scalability with DHT: data placement Block location: DHT with prefix routing Block mapped to hash prefix Prefix components Hosted on SNs empty prefix 0 1 hash=011..0 Block N=4 N components per prefix Node 1 00 1 01 10 11 0 2 Store fragments Node 12 3 1 1 Node 13 2 3 Node 14 0 0 2 Node 15 2 1 Node 16 3 3 0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 29 Scalability with DHT: data placement Block location: DHT with prefix routing Block mapped to hash prefix Prefix components empty prefix hash=011..0 Block Hosted on SNs 0 1 N=4 N components per prefix Node 1 00 1 01 10 11 0 2 Store fragments Node 12 3 1 1 Distributed consensus Node 13 Node 14 Node 15 2 0 0 2 2 3 1 Node 16 3 3 0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 30 Scalability with DHT: data placement Block location: DHT with prefix routing Block mapped to hash prefix Prefix components empty prefix hash=011..0 Block Hosted on SNs 0 1 N=4 N components per prefix Node 1 00 1 01 10 11 0 2 Store fragments Node 12 3 1 1 Distributed consensus Node 13 Node 14 Node 15 2 0 0 2 2 3 1 Node 16 3 3 0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 31 Scalability with DHT: data placement Block location: DHT with prefix routing Block mapped to hash prefix Prefix components empty prefix hash=011..0 Block Hosted on SNs 0 1 N=4 N components per prefix Node 1 00 01 10 11 Store fragments Node 12 3 1 1 Distributed consensus Node 13 Node 14 Node 15 2 0 0 1 2 0 2 3 2 1 Node 16 3 3 0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 32 Scalability with DHT: data placement Block location: DHT with prefix routing Block mapped to hash prefix Prefix components empty prefix hash=011..0 Block Hosted on SNs 0 1 N=4 N components per prefix Node 1 00 01 10 11 Store fragments Node 12 3 1 1 Distributed consensus Node 13 Node 14 Node 15 2 0 0 1 2 0 2 3 2 1 Node 16 3 3 0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 33 Scalability with DHT: data placement Block location: DHT with prefix routing Block mapped to hash prefix Prefix components empty prefix hash=011..0 Block Hosted on SNs 0 1 N=4 N components per prefix Node 1 00 01 10 11 Store fragments Node 12 3 1 1 Distributed consensus Load balancing Node 13 Node 14 Node 15 Node 16 2 0 0 1 2 3 0 2 3 3 2 1 0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 34 Data organization: synchrun chains A B C D E F G Data stream split to blocks
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 35 Data organization: synchrun chains A B C D E F G 010 101 110 011 000 011 100 Data stream split to blocks es of blocks computed
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 36 Data organization: synchrun chains A B C D E F G 010 101 110 011 000 011 100 Data stream split to blocks es of blocks computed Routing through DHT
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 37 Data organization: synchrun chains A B C D E F G 010 101 Prefix 01 110 011 000 011 100 Data stream split to blocks es of blocks computed Routing through DHT
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 38 Data organization: synchrun chains A B C D E F G 010 101 Prefix 01 110 011 000 Compression 011 100 Data stream split to blocks es of blocks computed Routing through DHT Erasure Coding
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 39 Data organization: synchrun chains 0 1 2 3 A B C D E F G 010 Prefix 01 101 110 011 000 Compression Erasure Coding 011 100 Data stream split to blocks es of blocks computed Routing through DHT Erasure-coded fragments stored by components
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 40 Data organization: synchrun chains A B C D E F G 010 Prefix 01 101 110 011 000 Compression 011 100 Data stream split to blocks es of blocks computed Routing through DHT 0 Erasure Coding A D F Erasure-coded fragments stored by components 1 A D F 2 3 A D F A D F
Synchrun HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 41 Data organization: synchrun chains A B C D E F G 010 Prefix 01 101 110 011 000 Compression 011 100 Data stream split to blocks es of blocks computed Routing through DHT 0 Erasure Coding Synchrun 1 Synchrun 2 Synchrun 3 A D F Erasure-coded fragments stored by components Grouped into synchruns 1 A D F 2 A D F 3 A D F
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 42 Data organization: synchrun chains A B C D E F G 010 Prefix 01 101 110 011 000 Compression 011 100 Data stream split to blocks es of blocks computed Routing through DHT 0 Erasure Coding Synchrun 1 Synchrun 2 Synchrun 3 A D F Erasure-coded fragments stored by components Grouped into synchruns 1 2 A D F A D F Containers stored on disks Fragment metadata separately from data 3 A D F Container Synchrun
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 43 Data organization: synchrun chains A B C D E F G 010 Prefix 01 101 110 011 000 Compression 011 100 Data stream split to blocks es of blocks computed Routing through DHT 0 Erasure Coding Synchrun 1 Synchrun 2 Synchrun 3 A D F Erasure-coded fragments stored by components Grouped into synchruns 1 2 3 A D F A D F A D F Containers stored on disks Fragment metadata separately from data Ordered synchrun chains Preserve order & locality Container Synchrun Manageable
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 44 Synchrun chains in a dynamic system 01:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 45 System growth: split 01:1 010:1 011:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 46 System growth: split 01:1 010:1 011:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 47 System growth: split 01:1 010:1 011:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 48 Concatenation 01:1 010:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 49 Concatenation 01:1 010:1 010:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 50 Marking blocks to reclaim 01:1 010:1 010:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 51 Space reclamation & Concatenation 01:1 010:1 010:1 010:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 52 Data Services: Identification of data resiliency level Missing fragments 01:0 01:1 01:2 01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 53 Data Services: Identification of data resiliency level 01:0 01:1 01:2 01:3 Chain scanning
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 54 Data Services: Identification of data resiliency level 01:0 01:1 01:2 01:3 Chain scanning
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 55 Data Services: Identification of data resiliency level 01:0 01:1 01:2 01:3 Chain scanning
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 56 Data Services: Identification of data resiliency level 01:0 01:1 01:2 01:3 Chain scanning
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 57 Data services: reconstruction 01:0 01:1 01:2 01:3 Sequential read/write of entire Containers Erasure decoding and re-encoding
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 58 Data services: reconstruction 01:0 01:1 01:2 01:3 Sequential read/write of entire Containers Erasure decoding and re-encoding
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 59 Data services: reconstruction 01:0 01:1 01:2 01:3 Sequential read/write of entire Containers Erasure decoding and re-encoding
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 60 Data services: fast data transfer 01:0 01:1 01:2 01:3 Old component 01:3 Location of new node (DHT)
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 61 Data services: fast data transfer 01:0 01:1 01:2 01:3 Data transfer Old component 01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 62 Data services: fast data transfer 01:0 01:1 01:2 01:3 Data transfer Old component 01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 63 Data services: fast data transfer 01:0 01:1 01:2 01:3 Data transfer Old component 01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 64 Data services: fast data transfer 01:0 01:1 01:2 01:3 Old component 01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 65 Data services for deduplication Global: duplicates detected in entire system DHT routing based on content Inline deduplication: has to be high-performance Prefetching Containers for streams of duplicates Block hashes stored separately
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 66 Data services for deduplication hash=011.. Block 01:0 Choose complete chain 01:1 01:2 01:3 Completeness: definitely not a duplicate Deletion interaction: wasn't the block scheduled for deletion?
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 67 Data services for deduplication hash=011.. Block 01:0 01:1 Query 01:2 01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 68 Data services for deduplication hash=011.. Block 01:0 01:1 01:2 01:3 Local candidate found
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 69 Data services for deduplication hash=011.. Block 01:0 01:1 01:2 Successful dedup 01:3 Candidate verification
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 70 On-demand data deletion Distributed garbage collection Per-block reference counter stored perfragment Failure-tolerant Block reference counter calculated independently on peer Container chains Interference with duplicate elimination: read-only phase for block tree traversal space reclamation in background
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 71 Writes during node failure Writing Reconstruction
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 72 Write Scaling nodes added while writing
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 73 Write Scaling nodes added while writing
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 74 Questions?