Decentralized Distributed Storage System for Big Data

Similar documents
Hadoop/MapReduce Computing Paradigm

A product by CloudFounders. Wim Provoost Open vstorage

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

Interface Trends for the Enterprise I/O Highway

Introducing SUSE Enterprise Storage 5

Reconstruyendo una Nube Privada con la Innovadora Hiper-Convergencia Infraestructura Huawei FusionCube Hiper-Convergente

Modern hyperconverged infrastructure. Karel Rudišar Systems Engineer, Vmware Inc.

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Embedded Technosolutions

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

CSE 124: Networked Services Lecture-17

Nutanix White Paper. Hyper-Converged Infrastructure for Enterprise Applications. Version 1.0 March Enterprise Applications on Nutanix

Expert Panel: Cloud Storage Initiatives: A Storage Developer Conference Preview August 4, 2015

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Webscale, All Flash, Distributed File Systems. Avraham Meir Elastifile

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Pivot3 Acuity with Microsoft SQL Server Reference Architecture

Vblock Architecture. Andrew Smallridge DC Technology Solutions Architect

Secure Block Storage (SBS) FAQ

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

When Hadoop-like Distributed Storage Meets NAND Flash: Challenge and Opportunity

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

The Fastest And Most Efficient Block Storage Software (SDS)

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Quobyte The Data Center File System QUOBYTE INC.

EMC Solution for VIEVU Body Worn Cameras

A Gentle Introduction to Ceph

Big Data and Object Storage

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

Dell EMC Surveillance for Reveal Body- Worn Camera Systems

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

Extremely Fast Distributed Storage for Cloud Service Providers

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Hyper-converged storage for Oracle RAC based on NVMe SSDs and standard x86 servers

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Storage for HPC, HPDA and Machine Learning (ML)

Roadmap for Enterprise System SSD Adoption

HCI: Hyper-Converged Infrastructure

BIG DATA READY WITH ISILON JEUDI 19 NOVEMBRE Bertrand OUNANIAN: Advisory System Engineer

Nový IBM Storwize V7000 Unified block-file storage system Simon Podepřel Storage Sales 2011 IBM Corporation

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Dell EMC Surveillance for VIEVU Body- Worn Cameras

Assessing performance in HP LeftHand SANs

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

Copyright 2012 EMC Corporation. All rights reserved.

CA485 Ray Walshe Google File System

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

EMC & VMWARE STRATEGIC FORUM NEW YORK MARCH David Goulden President & COO. Copyright 2013 EMC Corporation. All rights reserved.

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

Analytics in the cloud

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

RUNNING PETABYTE-SIZED CLUSTERS

We will also specifically discuss concept of a pooled system, storage node, pooling of PCIe as well as NVMe based storage.

Kinetic drive. Bingzhe Li

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE

Create a Flexible, Scalable High-Performance Storage Cluster with WekaIO Matrix

Dell EMC Surveillance for IndigoVision Body-Worn Cameras

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Veritas NetBackup on Cisco UCS S3260 Storage Server

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

Emerging Technologies for HPC Storage

Buy vs Build: Converged Platforms are the New End Game. Johannes Sieben Dell EMC CPSD varchitect Hyper-Converged

Dell Technologies IoT Solution Surveillance with Genetec Security Center

CS 655 Advanced Topics in Distributed Systems

Isilon Performance. Name

An Introduction to GPFS

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

VMware Virtual SAN Technology

EsgynDB Enterprise 2.0 Platform Reference Architecture

Flash Storage Complementing a Data Lake for Real-Time Insight

Nimble Storage vs HPE 3PAR: A Comparison Snapshot

PRESENTATION TITLE GOES HERE

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Computer Architecture 计算机体系结构. Lecture 6. Data Storage and I/O 第六讲 数据存储和输入输出. Chao Li, PhD. 李超博士

EMC ISILON HARDWARE PLATFORM

Hadoop and HDFS Overview. Madhu Ankam

IBM Storwize V7000 Unified

São Paulo. August,

EMC Surveillance for Edesix Body- Worn Cameras

Executive Brief June 2014

Dell EMC Isilon All-Flash

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

INTEL NEXT GENERATION TECHNOLOGY - POWERING NEW PERFORMANCE LEVELS

Using EMC FAST with SAP on EMC Unified Storage

Next-Generation Cloud Platform

Converged Platforms and Solutions. Business Update and Portfolio Overview

Enterprise Architectures The Pace Accelerates Camberley Bates Managing Partner & Analyst

Transcription:

Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University

Outline Trends in Big and Cloud Storage Decentralized storage technique UniStore project at Texas Tech

Big Storage Requirements Large capacity: 100s terabytes of data and more Performance-intensive: demanding big data analytics applications, real-time response protection: protect 100s terabytes of data from loss

Why Warehousing Fails in Big warehousing has been used to process very large data sets for decade A core component of Business Intelligence Not designed to handle unstructured data (emails, log files, social media, etc Not designed for real-time and fast response

Comparison Traditional data warehousing problem Big data problem Retrieve the sales figures of a particular item in a chain of retail stores exist in a database Cross-reference sales of a particular item with weather conditions at time of sale, or with various customer details, and to retrieve that information quickly

Scale-out storage Big Storage Trends A number of compute/storage elements connected via network Capacity and performance can be added incrementally Not limited by the RAID controller

Scaled-out NAS Big Storage Trends NAS: network attached storage Scale-out offers more flexible capacity/performance expansion (add NAS instead of disk in the slots of NAS) Parallel/distributed file system (Hadoop) to handle scale-out NAS EMC Isilon, Hitachi Systems, Direct Networks hscaler, IBM SONAS, HP X9000, and NetApp DATA Ontap

Object Storage Big Storage Trends Flat namespace instead of hierarchical namespace of a file system Objects are identified by IDs Better scalability and performance for very large number of objects Amazon S3 Hyperscale Architecture Mainly used for large infrastructure sites by Facebook, Google, Microsoft and Amazon Scaled-out DAS: direct attached server, commodity enterprise server attached with storage devices Redundancy: fail over entire server instead of components Hadoop run on top of a cluster of DAS to support big data analytics Part of the Software Defined Storage platform

Hyper-converged Storage Compute, network, storage and virtualization tightly integrated Buy a hardware box and get all you need VMware, Nutanix, Nimboxx

Scale-out Storage Centralized vs. Decentralized A centralized storage cluster: metadata server, storage servers and interconnections Scalability is bounded by the metadata server Multi-site distributed storage? Redundancy achieved by RAID Decentralized storage cluster No metadata server to limit the scalability Multi-site, geographically distributed replicated across servers, racks or sites

How to distribute data across nodes/servers/disks? P2P based protocol Distributed hash table Advantage Incremental scalability: build a small cluster and expand in the future Self-organizing Redundancy Issues Decentralized Storage migration upon data center expansion and failures Handling heterogeneous servers

Decentralized Storage: Consistent Hashing SHA-1 function SHA-1 function 1 holds D1 2 holds D2 3 holds D3 4 holds D1 2 holds D2 3 holds D3 1 holds nothing

Properties of Consistent Hashing Balance: each server owns equal portion of keys Smoothness: to add the k th server, 1/k fraction of keys located between it and predecessor server should be migrated Fault tolerance: multiple copies for each key, if one server down, find next successor with small change to the cluster view and balance still holds

Unistore Overview To build a unified storage architecture (Unistore) for Cloud storage systems with the co-existence and efficient integration of heterogeneous HDDs and SCM (Storage Class Memory) devices Based on a decentralized consistent hashing based storage system - Sheepdog Characterization Component Workloads Access patterns Devices Bandwidth Throughput Block erasure Concurrency Wear-leveling guide Placement Component I/O Pattern Random/Sequential Read/write Hot/cold I/O Functions Write_to_SSD Read_from_SSD Write_to_HDD Placement Algorithm Modified Consistent Hash

Background: Heterogeneous Storage Heterogeneous storage environment Distinct throughput NVMe SSD: 2000 or more MB/s SATA SSD: ~500 MB/s Enterprise HDD: ~150 MB/s Large SSDs are becoming available, but still expensive 1.2TB NVMe Intel 750 costs $1000 1TB SATA Saumsung 640 EVO costs $500 10 or more costly than HDDs SSDs still co-exist with HDDs as accelerator instead of replacing them 15

Background: How to Use SSDs in Cloud-scale Storage Traditional way of using SCMs (i.e. SSD) in cloud-scale distributed storage: as cache layer Caching/buffering generates extensive writes to SSD, which wears out the device Need fine-tuned caching/buffering scheme Not fully utilize capacity of SSDs The capacity of SSDs is growing fast Tiered Storage placed on SSD or HDD servers according to requirements Throughput Latency Access frequency transfer between tiers when the requirements changed 16

Tiered-CRUSH CRUSH ensures data placed across multiple independent locations to improve data availability Tiered-CRUSH integrates storage tiering into the CRUSH data placement 17

Tiered-CRUSH The virtualized volumes have different access pattern Access frequency of object recorded per volume, hotter data more likely to be placed on faster tiers Fair storage utilization maintained 18

Tiered-CRUSH: Evaluation Implemented in a benchmarks tool compiled with the CRUSH library functions Simulation showed that data distribution uniformity can be maintained Simulation shows 1.5 to 2X improvement in overall bandwidth in our experimental settings Device name Number Capacity(GB) Read bandwidth (MB/s) Samsung NVMe SSD Samsung SATA SSD 1 128 2000 2 256 540 Seagate HDD 3 1000 156 19

Pattern-directed Replication Trace object I/O requests when executing applications at first time Trace analysis, correlation finding and object grouping Reorganize objects for replication in the background 20

Version Consistent Hashing Scheme Build versions into the consistent hashing Avoid data migration when adding nodes or node fails Maintain efficient data lookup

Conclusions Decentralized storage becomes the standard in cloud storage Tiered-CRUSH algorithm achieves better IO performance and higher data availability at the same time for heterogeneous storage system Version consistent hashing scheme for improving manageability of data center PRS for high performance data replication by reorganizing the placement of data replications

Thank you! Questions? Visit: discl.cs.ttu.edu for more details