White Paper. Nexenta Replicast

Similar documents
Software-defined Storage: Fast, Safe and Efficient

Cobalt Digital Inc Galen Drive Champaign, IL USA

Open vstorage RedHat Ceph Architectural Comparison

INFINIDAT Storage Architecture. White Paper

Virtualizing SQL Server 2008 Using EMC VNX Series and VMware vsphere 4.1. Reference Architecture

IP Mobility vs. Session Mobility

Networking interview questions

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Turning Object. Storage into Virtual Machine Storage. White Papers

Copyright 2012 EMC Corporation. All rights reserved.

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

W H I T E P A P E R : O P E N. V P N C L O U D. Implementing A Secure OpenVPN Cloud

Block or Object Storage?

2014 VMware Inc. All rights reserved.

Got Isilon? Need IOPS? Get Avere.

Introduction to Protocols

EsgynDB Enterprise 2.0 Platform Reference Architecture

Network Security and Topology

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Take Back Lost Revenue by Activating Virtuozzo Storage Today

The Microsoft Large Mailbox Vision

Zero Data Loss Recovery Appliance DOAG Konferenz 2014, Nürnberg

The Future of Business Depends on Software Defined Storage (SDS) How SSDs can fit into and accelerate an SDS strategy

THE FUTURE OF BUSINESS DEPENDS ON SOFTWARE DEFINED STORAGE (SDS)

Mobile Surveillance Solution

5 Fundamental Strategies for Building a Data-centered Data Center

OSI Layer OSI Name Units Implementation Description 7 Application Data PCs Network services such as file, print,

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

Chapter 16 Networking

SurFS Product Description

LANCOM Techpaper Smart WLAN controlling

Chapter 12. Network Organization and Architecture. Chapter 12 Objectives Introduction Introduction

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Assessing performance in HP LeftHand SANs

ActiveScale Erasure Coding and Self Protecting Technologies

Page 1. Goals for Today" Discussion" Example: Reliable File Transfer" CS162 Operating Systems and Systems Programming Lecture 11

BUSINESS CONTINUITY: THE PROFIT SCENARIO

High-Performance Lustre with Maximum Data Assurance

DEDUPLICATION BASICS

HOW DATA DEDUPLICATION WORKS A WHITE PAPER

Network Working Group. Microsoft L. Vicisano Cisco L. Rizzo Univ. Pisa M. Handley ICIR J. Crowcroft Cambridge Univ. December 2002

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell EqualLogic Storage Arrays

The Transmission Control Protocol (TCP)

Cisco IP Fragmentation and PMTUD

Introduction to Open System Interconnection Reference Model

ADDENDUM TO: BENCHMARK TESTING RESULTS UNPARALLELED SCALABILITY OF ITRON ENTERPRISE EDITION ON SQL SERVER

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Why Performance Matters When Building Your New SD-WAN

UNLEASH YOUR APPLICATIONS

A product by CloudFounders. Wim Provoost Open vstorage

Part 1: Indexes for Big Data

FAST SQL SERVER BACKUP AND RESTORE

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

The Construction of Open Source Cloud Storage System for Digital Resources

Scality RING on Cisco UCS: Store File, Object, and OpenStack Data at Scale

A Gentle Introduction to Ceph

ActiveScale Erasure Coding and Self Protecting Technologies

SUSE Enterprise Storage Technical Overview

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Deliver Office 365 Without Compromise Ensure successful deployment and ongoing manageability of Office 365 and other SaaS apps

Kinetic drive. Bingzhe Li

Guide To TCP/IP, Second Edition UDP Header Source Port Number (16 bits) IP HEADER Protocol Field = 17 Destination Port Number (16 bit) 15 16

Implementation of a Reliable Multicast Transport Protocol (RMTP)

Open vstorage EMC SCALEIO Architectural Comparison

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Building Backup-to-Disk and Disaster Recovery Solutions with the ReadyDATA 5200

In modern computers data is usually stored in files, that can be small or very, very large. One might assume that, when we transfer a file from one

Unicasts, Multicasts and Broadcasts

OnCommand Cloud Manager 3.2 Deploying and Managing ONTAP Cloud Systems

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Architectural overview Turbonomic accesses Cisco Tetration Analytics data through Representational State Transfer (REST) APIs. It uses telemetry data

Storage for HPC, HPDA and Machine Learning (ML)

SNMP: Simplified. White Paper by F5

System Models for Distributed Systems

Symantec Backup Exec Blueprints

Distributed File Systems II

UNIT IV -- TRANSPORT LAYER

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

ET4254 Communications and Networking 1

Storage Optimization with Oracle Database 11g

Load Balancing 101: Nuts and Bolts

vsan Mixed Workloads First Published On: Last Updated On:

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

Military Messaging. Over Low. Bandwidth. Connections

White Paper Features and Benefits of Fujitsu All-Flash Arrays for Virtualization and Consolidation ETERNUS AF S2 series

Staggeringly Large Filesystems

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

SolidFire and Pure Storage Architectural Comparison

The Value of Peering. ISP/IXP Workshops. Last updated 23 rd March 2015

White paper ETERNUS CS800 Data Deduplication Background

The Google File System

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Overview of Networks

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

VMware Virtual SAN. Technical Walkthrough. Massimiliano Moschini Brand Specialist VCI - vexpert VMware Inc. All rights reserved.

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

The Interconnection Structure of. The Internet. EECC694 - Shaaban

Data Protection for Cisco HyperFlex with Veeam Availability Suite. Solution Overview Cisco Public

Copyright 2012 EMC Corporation. All rights reserved.

Transcription:

White Paper Nexenta Replicast By Caitlin Bestler, September 2013

Table of Contents Overview... 3 Nexenta Replicast Description... 3 Send Once, Receive Many... 4 Distributed Storage Basics... 7 Nexenta Replicast Advantages... 9 Copyright 2014 Nexenta All Rights Reserved 2

Overview Nexenta Replicast is a storage and transport stack that enables efficient and reliable network replication of storage content. This storage software feature was developed by, and is the property of, Nexenta Systems, Inc. The purpose of this white paper is to educate the reader about the concepts of Nexenta Replicast and describes its functionality. Nexenta Replicast Description Nexenta Replicast is a storage ( Replicast Storage ) and transport ( Replicast Transport ) stack that enables efficient and reliable network replication of storage content. Replicast Transport utilizes lower layer multicast protocols providing unreliable datagram service and is positioned directly on top of the Layer 4 in the conventional OSI/ISO 7-layer model. Figure 1: OSI/ISO 7-layer model with Nexenta Replicast. Copyright 2014 Nexenta All Rights Reserved 3

Nexenta Replicast Storage in turn uses its own underlying (Replicast) Transport layer in order to: Efficiently replicate content on a Send Once, Receive Many (SORM) basis, which drastically reduces the required network bandwidth Use multicast communications to control which specific servers store replicas, thus enabling a new model for controlling distributed storage Nexenta Replicast radically optimizes network bandwidth while simultaneously avoiding network congestion and dynamically load-balances replicated data between participating storage servers. Send Once, Receive Many Users demand uninterrupted access to their data at all times. Data availability implies a certain degree of redundancy of all system components providing the corresponding storage service: drives, servers, network. Protection against server and (often) drive failures means that multiple copies of the same data must be placed on multiple storage servers. To achieve that, data is being replicated over distance, typically over TCP/IP intranet, sometimes over WAN. Data replication takes significant and constantly increasing portion of the overall network bandwidth. As the cost of raw storage plummets, the cost of network replication determines how much protection against content loss is affordable. In 2011 the cost of 1TB 7200 RPM SATA drive was above $400, which compares favorably to today s prices below $70. Same is happening to SAS and SSD drives. In short, same amount of dollars can now buy at least 3, often more times as many TBs than in 2011. But 3 times as many TBs of network traffic would be nearly 3x the cost. Assume a given content must be replicated to protect against the simultaneous loss of 3 (three) replicas. Conventional distributed storage solutions would then require 4 network replicas that is, 4 (four) transmissions of the same content over network interconnecting servers that store the content. Complex or advanced solutions can reduce that below 4 replicas, but only very slightly. Nexenta Object Storage (NOST) 1.0 uses local RAID-1/RAID-10 mirroring to implement a "2*2" solution. NOST executes only two network replicas because each network replica is locally RAID-1 protected against the loss of a disk drive. However, these solutions have costs. Erasure encoding based solutions implement a custom (typically configurable) N/M schema over multiple storage servers, where the storage system of N servers can withstand a loss of M (M < N) servers. Erasure encoding however slows access to content since majority of the data slices must be retrieved to restore the original data. Just as an army marches at the speed of its slowest soldier, a cluster using erasure encoding is as slow as its busiest server. The Nexenta Copyright 2014 Nexenta All Rights Reserved 4

Object Storage "2*2" solution requires configuration of parallel networks to avoid risk of reduced availability from loss of two network links or forwarding elements. Figure 2: Conventional Replication. Source sends whole chunk to each Replica. Figure 3: Erasure Encoding. About half of the payload is sent to each slice. Copyright 2014 Nexenta All Rights Reserved 5

Figure 4: NOST 1.0 2x2 Solution. Two network replicas that each protect against loss of one drive locally. While working to make the 2*2 solution easier to deploy, we realized that a 2*2 solution still required transmitting the data twice. That s a major improvement over transmitting 4 times. But - why not go for a solution where all four replicas could be created by transmitting just once? Replicast enables exactly that effectively reliable replication of content to any number of recipients with a single transmission. Figure 5: Nexenta Replicast. Send once, receive many. Each chunk is transmitted as a sequence of datagrams. Accurate reception is verified with a hash signature of the entire chunk. This radically reduces the number of packets required to Copyright 2014 Nexenta All Rights Reserved 6

verify accurate reception. The cost is that when a retransmission is needed the entire chunk must be retransmitted. But this is acceptable when virtually all errors are caused by congestion drops, rather than actual transmission errors. When replicating storage chunks, a check against data corruption far better than a network checksum is required anyway. When you are storing petabytes, letting 1 in 64K bit errors go undetected simply will not work, 1 in 2**32 isn't much better. Nexenta Object Storage (NOST) is based on Nexenta s Cloud Copy on Write (CCOW) technology that uses a cryptographic hash algorithm (such as SHA-256, SHA-512 or SHA-3) to protect against data corruption and forgery. Lesser hash algorithms may be used when only protection from data corruption is required. When specialized offload instructions and/or parallel processing solutions (such as GPUs) can perform the hashing there is minimal benefit to using a cheaper algorithm, however an algorithm such as Murmurhash is suitable when pre-image attacks are not a concern. Nexenta Replicast relies on reservations issued by the receivers to avoid congestion on the destination links. Congestion on the inbound links is confined to the buffering within the originating host. Nexenta Replicast assumes that a maximum bandwidth can be stated for each port which can be carried through the core network to any destination in a non-blocking fashion. So if each Nexenta Replicast transmitter confines itself to the provisioned maximum bandwidth, then congestion is avoided because the destinations only issues reservations up to their provisioned input capacity. Making 4 (four) replicas of a chunk with a single transmission is a major breakthrough, but it turns out that the techniques that enable making the reservations to support rendezvous transmissions have even greater benefits for distributed storage. Distributed Storage Basics A Distributed Storage system breaks each object down into a root (which can be version specific) and zero or more payload chunks. The root is identified by the object name (and version). The chunks have arbitrary identifiers that are referenced in the root. They can be derived from the payload or merely be generated as some form of global serial number. Nexenta Object Storage identifies chunks by the cryptographic hash of the compressed payload. OpenStack Swift does not chunk objects. The HDFS namenode generates arbitrary identifiers for its chunks ("blocks"). Copyright 2014 Nexenta All Rights Reserved 7

Figure 6: A distributed storage system. Both the version roots ("Version Manifests" for Nexenta Object Storage) and the chunks must be distributed to multiple storage servers to protect against the loss of specific servers. The fundamental issues for any distributed storage system are: How are the storage servers to hold a given Version Manifest or Chunk selected? How do you find them when you come back later looking for them? Traditionally there have been two solutions: 1. Central metadata server makes the choices, and then remembers what it chose. This is how pnfs and HDFS work 2. A Consistent Hashing Algorithm picks the set of servers for any given Manifest/Chunk based on an agreed upon hash of its identifier o If two parties agree on the set of servers the resources should be distributed over, they will always assign resource X to the same set of servers o A keep-alive ring is responsible for reaching consensus on "the set of servers" that is connected Nexenta Replicast enables a novel third solution: 3. A small group of servers uses multicast datagrams to dynamically select the storage servers. Instead of looking up N servers in a hash table, you look up a multicast group which contains M servers (probably 10 to 20). Those servers dynamically select the N that will store the chunk Copyright 2014 Nexenta All Rights Reserved 8

When it is time to retrieve the chunk multicast datagrams are used to find the least busy server amongst the M (10 to 20) servers in the multicast group Selecting the best N of M servers dramatically reduces the time it takes to process each put or get. Making a dynamic selection from a small group (M) of servers provides virtually all the benefit of selecting from a larger group, at a fraction of the cost. Picking just N on an arbitrary/random basis can never achieve a high utilization of IOP capacity. Centralized control has its own costs of enforcing its hierarchy with sharp limits on scalability. Consistent hashing, like all forms of hashing, has a heavy effectiveness fall-off once utilization of raw capacity goes over 50%. Nexenta Replicast enables higher utilization, meaning ultimately that fewer servers can provide the same number of utilized IOPs as could be achieved with consistent hashing, and without the cliff effects inherent in centralized control. Nexenta Replicast Advantages Nexenta Replicast replaces the conventional Distributed Hash Table (DHT see for instance http://en.wikipedia.org/wiki/distributed_hash_table) with a collaboratively edited Hash Allocation Table. At a minimum, this allows the distribution of chunks across servers to be dynamically adjusted when a default hash algorithm results in a lop-sided, unbalanced distribution of chunks for a specific storage cluster. Nexenta Replicast also allows storage servers to act as volunteers, to collaboratively optimize storage access latency, remaining free capacity, servere utilization and/or any other meaningful configurable management policy. While not being counted towards the required membership, volunteer servers can join specific Negotiating Groups to provide storage services to a specific subset of chunks. Volunteer servers using fast storage (RAM and/or SSD) will cache to optimize IOPs for the most frequently (and/or most recently) accessed content, thus improving the overall cluster efficiency. Nexenta Object Storage (NOST) enabled unlimited caching for Object Storage, but could only do so for in-line local caches. Nexenta Replicast enables free roaming caches. Nexenta Replicast is fundamentally a method of enabling efficient collaboration within a storage cluster. All of its algorithms are intrinsically distributed. They are also heterogeneous - new heuristics can be deployed to achieve special objectives. Nexenta is a registered trademark of Nexenta Systems Inc., in the United States and other countries. All other trademarks, service marks and company names mentioned in this document are properties of their respective owners. Copyright 2014 Nexenta All Rights Reserved 9