Scale-out Data Deduplication Architecture

Scale-out Data Deduplication Architecture Gideon Senderov Product Management & Technical Marketing NEC Corporation of America

Outline Data Growth and Retention Deduplication Methods Legacy Architecture Limitations Adaptive Platform for Long-term Data Enhanced Scale-out Attributes Scalability Performance Resiliency Page 2

Long-term Data According to SNIA, 80% of files are dormant, unchanged. According to Contoural, average file shares grow 40% annually. Non-mission critical data filling costly Tier 1 storage (EB) 90 80 70 60 50 40 30 20 10 0 Consumption of Enterprise Disk Capacity by Type 88% 2008 2009 2010 2011 2012 2013 2014 Content Depots & Cloud Unstructured data Replicated data Structured data CAGR 75.6% 54.8% 24.2% 23.6% Page 3 Source: IDC 2010

Data Outlasts Its Container MIGRATE DATA MIGRATE DATA DATA MINING MIGRATE DATA WWW DATA MINING FILE SHARING FILE SHARING Months Years Decades Centuries Increasing Enterprise Size Page 4

Data Migration Costs In 2011 there will be 1775 Exabytes of information Archive data outlasts a storage container, data migrate costs $1K $1.8K per TB 15% of Storage Admin s time spent on data migration Source: IDC, Gartner, Contoural Consulting Page 5

Deduplication Granularity File (SIS) Data deduplication performed at a file or object granularity Sub-file (Block) Data deduplication performed at a sub-file granularity Fixed size chunk Variable size chunk Sub-file variable size chunk deduplication benefits Greater efficiency by deduplicating data at finer granularity Automatic adjustments ( slide ) across inserted data Page 6

Inline vs. Post-process Inline Data deduplication performed before writing the deduplicated data Post-Process Data deduplication performed after the data to be deduplicated has been initially stored Inline deduplication benefits Eliminate need for disk buffer for un-deduplicated data Eliminate need for subsequent idle time for processing Immediate replication and shorter RTO for DR Page 7

Generic vs. Application-aware Generic Deduplication with same algorithm against all data streams Application-aware Custom deduplication for specific data streams to account for metadata inserted by the corresponding applications Application-aware deduplication benefits Greater efficiency by eliminating negative impact of application metadata on deduplication ratio Cross-application deduplication for greater scope and higher deduplication efficiency Page 8

Local vs. Global Local Deduplication across a single node or sub-node, requiring separate deduplication repositories for multiple nodes Global Deduplication spanning multiple nodes with a single shared deduplication repository across all nodes Global deduplication benefits Greater scalability of a single deduplication repository Greater efficiency with a broader scope spanning multiple nodes Page 9

Legacy Scalability Limitations Inadequate scalability of capacity & performance Cannot scale performance to keep up with growth Multiple products with different architectures More siloed capacity to manage Limited deduplication scope Limited scalability proliferates duplicate data across appliances Lower deduplication ratio for large environments Page 10

Local vs. Global Scope Single deduplication repository across entire solution Data deduplication across ALL data from ALL nodes Cross-node dedupe for greater efficiency Cross-application dedupe with application-awareness Local Volume Local Node Global Dedupe Repository Dedupe Repository Dedupe Repository Dedupe Repository Dedupe Repository Dedupe Repository Page 11

Legacy Resiliency Limitations Day 1: Full Day 2: Incremental Day 3: Incremental Day 7: Full 1 7 1 4 4 6 2 1 1 6 6 3 5 1 1 6 4 7 1 6 1 4 7 8 Q: What data can be restored if block #1 lost? A: NONE! 1 2 3 4 5 6 7 8 Traditional RAID is not sufficient for deduped data Page 12

Adaptive Scale-out Architecture One system becomes Greener, Faster, Denser Non-disruptive Remove Non-disruptive Add, Upgrade V1 Grid + V1 + V2 + Vx = 1 System Concurrently support multiple HW generations Non-disruptive resource scaling Zero provisioning, dynamic self-management Automatically balance system workload In-place technology refresh No migration Page 13

Performance Independent Linear Scalability Uniquely Scale Performance and Capacity to Meet Current and Future Needs Capacity Customized performance and capacity per application Dynamically adjust to changing data growth Page 14

Enhanced Scale-out Attributes Criteria Scope Performance Protection Attributes GLOBAL DEDUPE SCALABLE DEDUPE SAFE DEDUPE Page 15

Scope Multiple applications and data sources across a single scalable deduplication repository Broader scope leads to greater efficiencies Higher dedupe ratios Improved capacity utilization Easier management Longer retention periods Lower costs Single scalable system vs. multiple disparate silos Cross-application deduplication for greater efficiency Global deduplication for entire environment Page 16

Performance Linear performance scalability to keep up with data growth Scalability Linear scalability vs. high degradation Independent performance scalability Beyond the physical boundaries of the system Inline dedupe vs. post-process Keeping up with the workload Effects of deduplication rates Higher performance with higher dedupe Page 17

Scale-out Dedupe Architecture Multi-node architecture Inline data routing Processing and memory scales with capacity Distributed hash table Linear performance scalability Global deduplication across ALL nodes Page 18

Page 19 Linear Performance Scalability

Resiliency Enhanced data resiliency, especially with fewer copies due to deduplication The greater the dedupe ratio, the greater the exposure Maximum protection against multiple failures Component failures Appliance or node failures Upgradeability and technology refresh In-place technology refresh vs. forklift upgrade Online versus downtime Configurable resiliency for different applications Investment protection Page 20

Erasure-coded Resiliency User dialable disk/node protection Intermix of multiple resiliency levels Dynamically allocated protection Greater protection with less overhead No idle spare drives Faster healing while maximizing I/O performance Only data is reconstructed Reconstruction across multiple spindles/processors Page 21

Long-term Evolution Functionality Application-specific requirements (Security, Resiliency) Regulatory compliance requirements Performance Linear scalability throughput HW/SW enhancements Multiple I/O profiles Attributes Flexibility Efficiency Granularity Page 22

Storage Spending Long-term Data Protection Clones/Snapshots WAN-Optimized Replication Erasure-Coded Resiliency Inline Global Deduplication Automatic Load Balancing Dynamic Provisioning Massive Scalability Features drive savings Months Years Decades Centuries Time Page 23