Why Scale-Out Big Data Apps Need A New Scale- Out Storage

Why Scale-Out Big Data Apps Need A New Scale- Out Storage Modern storage for modern business Rob Whiteley, VP, Marketing, Hedvig April 9, 2015

Big data pressures on storage infrastructure The rise of elastic software-defined storage (SDS) Agenda 6 SDS capabilities for big data 3 cases studies of SDS for big data Copyright 2015 Hedvig Inc. Confidential.

Big data pressures on storage infrastructure

According to Forrester... 10X faster growth of enterprise data than storage budgets 58% of orgs take days, weeks, or months to provision storage 14% of orgs have cloud-like provisioning capabilities Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report. 5 Copyright 2015 Hedvig Inc. Confidential.

Three truths and a lie about storage & big data Software-defined storage is the right direction. Hyperconverged provides the best economics. Big data apps are repeating the sins of the 90s. Hyperscale helps virtualize Hadoop and NoSQL. 6 Copyright 2015 Hedvig Inc. Confidential.

The rise of elastic softwaredefined storage (SDS)

A big data inflection point in storage Storage capabilities Before Hardware-defined Scale-out High-availability + RAID Hyperconverged After Software-defined Elastic Distributed + Replication Hyperconverged + Hyperscale The big data so-ware storage inflec3on point Traditional Scale-up Scale-out Elastic Price/ performance 8 Copyright 2015 Hedvig Inc. Confidential.

Three legs to the big data requirements stool Virtual SANs Deployment flexibility Software-defined storage Storage flavors Storage features Monolithic arrays Hyperconverged 9 Copyright 2015 Hedvig Inc. Confidential.

Hyperscale Hyperconverged LINUX Hypervisor Windows LINUX Hypervisor Windows Hadoop, NoSQL cluster Storage cluster Hadoop/NoSQL + Storage cluster Storage client Storage node

How SDS provides elastic storage for big data Big data Big data Big data 1 Admin provisions virtual volumes and script or apply storage policies 2 Virtual volume presents block, file, & object storage to big data hosts iscsi Storage cluster NFS Object 3 4 Storage client captures guest I/O and communicates to underlying cluster Cluster distributes and replicates data, applies compression & dedupe 5 Cluster autotiers & balances to optimize data locality & availability = x86 or ARM server DC1 DC2 Cloud3 11 Copyright 2015 Hedvig Inc. Confidential.

6 SDS capabilities for big data

6 big data friendly SDS capabilities 1. I/O sequentialization 2. Tunable replication 3. DR replication 4. Disk failures and rebuilds 5. Data efficiency methods 6. Flash caching & flash pinning 13 Copyright 2015 Hedvig Inc. Confidential.

1. Random I/O to sequential writes Big data node 1 Application writes data in random blocks, and gets immediate ack from cluster. 2 Storage cluster sequentializes incoming blocks (in RAM+SSD) into larger chunks. 3 Storage cluster writes larger sequentialized data chunks to underlying disks in auto-balanced, and autodistributed manner according to policy. Storage client Storage node 14 Copyright 2015 Hedvig Inc.

Example: Single write operation Big data node Example Policy: 3 COPIES; AGNOSTIC 1 2 Application sends write to any storage cluster node. (round-robin) Cluster node writes first aggregated blocks locally. Second copy written to first responding cluster node. SSD/Flash SAS/SATA 3 4 Ack sent back to big data node after majority quorum of acks. (2 ack s in case of 3 copies) CHECKSUMMED! Third copy is written semi-synchronously. Could also be synchronous if all servers are equidistant. Hedvig Controller software Hedvig Cluster software 15

2. Granular replication of data Chunks are distributed across all servers and containers in the storage cluster. Big data node Granular data chunks Disk Platter Disk Platter Disk Platter Storage containers Hedvig Controller software Hedvig Cluster software

3. DR Policy: 3xDC-aware with 3 copies DR Policy: Datacenter-aware (One copy per DC) Data Copies: 3 Sync-Acknowledgements: 2 Active Active Data Center A Data Center B Data Center C Hedvig Controller software Hedvig Cluster software

4. Disk failures and rebuilds Disks managed in protection groups. Disk rebuilds initiated automatically upon disk failure across entire cluster. No spare disks needed. Quick wide-stripe rebuilds allow for largest disks. Average 4TB disk rebuild time is under 20 minutes. Easily support 6TB, 8TB and 10TB drives.

5. Thin provisioning, deduplication and compression Thin provisioning for every virtual volume Inline compression and deduplication Global, system-wide deduplication all attached storage nodes participate 60-75% data reduction dedupe rates vary based on data type Dedupe cache can reside on Controller SSD/flash in application server Eliminate all duplicate I/O from network, dramatically lower latency and increase IOPS! Clone non-deduped volume with dedupe Client-side SSD/flash dedupe read cache with dedupe map Big data node + storage client Cluster SSD/flash read+write cache Storage node

6. 3 ways Hedvig uses SSD and PCIe flash Big data node + storage client Client side read and dedupe cache on big data node Storage node Read/write cache on storage nodes Primary storage as dedicated volume on storage nodes + flash pinning

3 cases studies of SDS for big data

Three case studies Fortune 100 bank Fortune 50 telecom 4 th largest US law firm Deploying Cassandra and MongoDB for developers with infrastructure self-provisioning for DevOps model. Multiple NoSQL deployments leading to islands of (elastic) storage and inability to self-provision or plug into bank s orchestration tools. Building elastic SDS cluster on commodity infrastructure to lower cost per bit by and drive selfprovisioning through RESTful APIs. Seeks centralized, shared storage to virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands of (also elastic) storage and preventing IT s virtualization-first policy. Virtualizing all three Hadoop distributions and deploying SDS as the data lake for scale-out storage and global dedupe across data sets. Needs quick, reliable indexing of 100M active client docs in HP Autonomy. Needed a scale-out, flash-friendly solution to replace local SSDs, which are required to achieve sub-one second index queries. Getting 6x performance with SDS versus traditional hybrid array, which included flash tier; now has incremental commodity scalability. 22 Copyright 2015 Hedvig Inc. Confidential.

Thank you!