MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS

Similar documents
MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Distributed File Systems II

THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY

Remove complexity in protecting your virtual infrastructure with. IBM Spectrum Protect Plus. Data availability made easy. Overview

StoreVault and NetApp Snapshot Technology

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

CLOUD-SCALE FILE SYSTEMS

Cohesity Flash Protect for Pure FlashBlade: Simple, Scalable Data Protection

WHITE PAPER: ENTERPRISE SOLUTIONS

Protecting Microsoft SQL Server databases using IBM Spectrum Protect Plus. Version 1.0

Data Protection Using Premium Features

Data Storage Infrastructure at Facebook

Server Fault Protection with NetApp Data ONTAP Edge-T

How CloudEndure Disaster Recovery Works

How CloudEndure Disaster Recovery Works

How CloudEndure Works

Chapter 2 CommVault Data Management Concepts

Part 1: Indexes for Big Data

IBM Data Replication for Big Data

Tivoli Storage Manager for Virtual Environments: Data Protection for VMware Solution Design Considerations IBM Redbooks Solution Guide

Data Sheet: Storage Management Veritas Storage Foundation for Oracle RAC from Symantec Manageability and availability for Oracle RAC databases

How CloudEndure Works

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

The Google File System

MapReduce. U of Toronto, 2014

A Thorough Introduction to 64-Bit Aggregates

MapR Enterprise Hadoop

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY

EsgynDB Enterprise 2.0 Platform Reference Architecture

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Guide to Mitigating Risk in Industrial Automation with Database

Distributed Systems 16. Distributed File Systems II

Chase Wu New Jersey Institute of Technology

Protecting Hyper-V Environments

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

EMC DATA DOMAIN OPERATING SYSTEM

Data Protection for Virtualized Environments

Hadoop and HDFS Overview. Madhu Ankam

Data Domain OpenStorage Primer

Continuous Data Protection

BigData and Map Reduce VITMAC03

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

IBM InfoSphere Information Analyzer

GFS: The Google File System. Dr. Yingwu Zhu

Configuring and Deploying Hadoop Cluster Deployment Templates

HPE 3PAR File Persona on HPE 3PAR StoreServ Storage with Veritas Enterprise Vault

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

ECE Engineering Robust Server Software. Spring 2018

Dell EMC Isilon with Cohesity DataProtect

Informatica Enterprise Information Catalog

CA ARCserve Backup. Benefits. Overview. The CA Advantage

The VERITAS VERTEX Initiative. The Future of Data Protection

Abstract. The Challenges. The Solution: Veritas Velocity. ESG Lab Review Copy Data Management with Veritas Velocity

Lambda Architecture for Batch and Stream Processing. October 2018

ZYNSTRA TECHNICAL BRIEFING NOTE

BrightStor ARCserve Backup for Windows

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Dell Storage vsphere Web Client Plugin. Version 4.0 Administrator s Guide

GFS: The Google File System

Oracle Zero Data Loss Recovery Appliance (ZDLRA)

EMC VNXe Series. Configuring Hosts to Access NFS File Systems. Version 3.1 P/N REV. 03

Oracle Database 18c and Autonomous Database

A Thorough Introduction to 64-Bit Aggregates

EMC VNXe3200 Unified Snapshots

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Veritas Storage Foundation for Oracle RAC from Symantec

Protecting Miscrosoft Hyper-V Environments

Veritas InfoScale Enterprise for Oracle Real Application Clusters (RAC)

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into

Microsoft SharePoint Server 2013 Plan, Configure & Manage

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Backup and Archiving for Office 365. White Paper

WHITE PAPER. The General Data Protection Regulation: What Title It Means and How SAS Data Management Can Help

Efficient, fast and reliable backup and recovery solutions featuring IBM ProtecTIER deduplication

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Linux Filesystems Ext2, Ext3. Nafisa Kazi

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

WITH INTEL TECHNOLOGIES

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Security and Performance advances with Oracle Big Data SQL

Technical Note. Dell/EMC Solutions for Microsoft SQL Server 2005 Always On Technologies. Abstract

Merging Enterprise Applications with Docker* Container Technology

IBM Spectrum Protect Plus

Flash Storage Complementing a Data Lake for Real-Time Insight

Microsoft SQL Server

Dell SC Series Snapshots and SQL Server Backups Comparison

A Close-up Look at Potential Future Enhancements in Tivoli Storage Manager

Dell EMC Surveillance for Reveal Body- Worn Camera Systems

DATA PROTECTION SOLUTION MICROSOFT SQL SERVER

Provisioning with SUSE Enterprise Storage. Nyers Gábor Trainer &

MarkLogic Technology Briefing

Hybrid Backup & Disaster Recovery. Back Up SAP HANA and SUSE Linux Enterprise Server with SEP sesam

TRIM Integration with Data Protector

IBM FileNet Content Manager and IBM GPFS

Backup Solution Testing on UCS for Small-Medium Range Customers (Disk to Tape)-SAS

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014

ELASTIC DATA PLATFORM

Transcription:

MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS

INTRODUCTION The ability to create and manage snapshots is an essential feature expected from enterprise-grade storage systems. This capability is increasingly seen as critical with big data systems as well. A snapshot captures the state of the storage system at an exact point in time and is used for recovering data that was lost or corrupted due to application error, user error, or a malicious event. SNAPSHOT BENEFITS WITH THE MAPR CONVERGED DATA PLATFORM Snapshots within a big data context are useful in both storage and compute. The MapR Converged Enterprise Edition of the MapR Converged Data Platform provides consistent snapshots that offer the following benefits: RECOVERY FROM ERRORS Operational as well as analytical applications manipulate the data in the distributed file system on behalf of the user or the administrator. Application-level errors or even inadvertent user errors can mistakenly delete data or modify data in an unexpected way. In this case, snapshots can be used to restore data quickly and easily. MANAGING REAL-TIME DATA ANALYSIS By using snapshots, query engines like Apache Drill can produce precise SQL query results against data sources subject to constant updates, such as sensor data or social media streams. As new SQL queries are written, either to achieve more specific data results or to improve performance of existing queries, if the underlying data is constantly changing, it is difficult to assess query improvement. Snapshots allow new queries to run against a specific point-in-time image of data to make apples-to-apples comparisons against existing queries. Snapshots ensure the data for both queries is constant, even in a scenario where data is continuously being ingested in real-time. MACHINE LEARNING MODEL TRAINING Machine learning frameworks, such as TensorFlow, can use snapshots to get a known, unchanging, point-in-time view of an otherwise continuously changing data set. These snapshots allow for an auditable model training process, where ongoing updates to machine learning models can be run against a static baseline data set. This helps to identify actual improvements in the model, since the snapshot data remains unchanged. VERIFYING DATA LINEAGE Data lineage helps businesses track the data sources, transformations, and destinations as the data flows through MapR. It is often used to identify the cause of any irregularities in a data set or report. However, to investigate an observed irregularity, auditors have to view the history of the data as it was when the data was created. Snapshots create an immutable view of the data to ensure the lineage cannot be corrupted from modifications. COMPLIANCE AUDITING Organizations often have to demonstrate they were in compliance with required standards over a certain time period. In such a scenario, snapshots can help businesses show an immutable view of data at a certain point in time to prove they were in compliance. MULTIVARIATE AND A/B TESTING AND DATA VERSIONING Snapshots can be used to create immutable views of the results of multivariate and A/B testing. This means the results can be compared and analyzed while being safeguarded against inadvertent or even malicious modifications. 1

ONLINE BACKUPS With snapshots, you can create a baseline for consistent backups. Snapshots can be created when the system is running, thus allowing you to create consistent backups without having to halt data updates in the system. Snapshots also allow you to backup and restore online MapR-DB tables (using the CopyTable functionality). MAPR SNAPSHOTS IMPLEMENTATION In MapR, a snapshot is a read-only image of a volume at a specific point in time. Snapshots are created at the volume level, not the entire cluster level, which means you can isolate them to specific segments of your data sets. You can create a snapshot in an ad hoc manner or automate the process with a schedule. A snapshot takes almost no time to create and uses very little incremental disk space because they are not standard copies of data. They are implemented with a method known as redirect-on-write which is very space-efficient. You can take a snapshot of a 1 petabyte cluster in seconds with no additional data storage required. The redirect-on-write method provides protection without duplicating the data. It uses pointers to keep track of data blocks. If a block needs modification, MapR simply redirects the pointer for that block to another block and writes the new data there (i.e., it redirects on writes). MapR keeps track of all the blocks that comprise a given snapshot. If a process accesses data in a given snapshot, it simply uses these pointers to access those blocks. Here s a brief explanation of how redirect-on-write works: Time T0. File F1 is written. It uses data blocks A,B, and C. Figure 1. Data blocks A, B, and C are mutable 2

Time T1. We create snapshot S1. Data blocks A, B, and C are made immutable and S1 points to A, B, and C. SNAPSHOT S1 Figure 2. Data blocks A, B, and C are made immutable Time T2. We need to make changes to F1. Since MapR-FS supports full read-write capabilities, any part of the file can be updated. In this example, let s say the data in block C needs to be updated. Since block C is immutable, we create a separate block C and add the changes there. Snapshot S1 still points to data blocks A, B, and C. We can restore or view the contents of file F1 as they were at time T1. Current contents of F1 are held in data blocks A, B, and C. SNAPSHOT S1 Figure 3. Data blocks A, B, and C are immutable; C is mutable 3

Time T3. We create snapshot S2. Data block C is made immutable, and snapshot S2 points to A, B, and C. SNAPSHOT S1 SNAPSHOT S2 Figure 4. Data blocks A, B, C, and C are immutable We can now restore or view the contents of file F1 at timestamps T1 and T3 using snapshots S1 and S2, respectively. Note that there is no data duplication of blocks A and B between file F1 and snapshots S1 and S2. Only the data blocks that are changed after a snapshot is taken will require incremental storage space. Another important fact is that in MapR, the snapshots are implemented directly in the storage layer in an efficient and fast way. Any application that saves data in MapR benefits from snapshots out of the box. Moreover, since MapR Snapshots are atomic and consistent, applications have exactly the view of the data at the time that the snapshot was taken. This is not true on HDFSbased data platforms, as will be explained in a section below. WORKING WITH MAPR SNAPSHOTS The following sections describe procedures associated with snapshots: Creating a snapshot (requires MapR Converged Enterprise Edition). You can create a snapshot in an ad hoc manner or use a schedule to automate snapshot creation. Each snapshot created by a schedule has an expiration date that determines how long the snapshot is retained. When you schedule snapshots, the expiration date is determined by the Retain parameter of the schedule. Viewing the contents of a snapshot. At the top level of each volume is a directory called.snapshot containing all the snapshots for the volume. You can view the directory with Hadoop commands (e.g., hadoop fs -ls ) or by mounting the cluster with NFS to view the directory with operating system tools (like the Linux ls command). 4

Viewing a list of snapshots. You can view a list of snapshots for a volume with the volume snapshot list command or with the MapR Control System (MCS). Removing a snapshot. Each snapshot has an expiration date and time, when it is deleted automatically. You can remove a snapshot before its expiration, or you can preserve a snapshot to prevent it from expiring. You can remove a snapshot with the volume snapshot remove command or with MCS. For more information about using the snapshots functionality through the MapR Control System, please refer to the official MapR documentation. HDFS-BASED DATA PLATFORMS AND SNAPSHOTS MapR Snapshots capture data in a precise and consistent state. HDFS and HBase snapshots, in contrast, do not provide consistency and lack many other important capabilities. This means that HDFS and HBase snapshots cannot be trusted for the scenarios described earlier in this document. HDFS SNAPSHOTS HDFS is an append-only file system, so intuitively it appears easy to implement snapshots. However, the separation of data and metadata in HDFS, combined with the NameNode being a bottleneck in the system, makes it difficult or impossible to implement consistent snapshots. As a result, HDFS snapshots took years to implement and are not consistent; hence, they do not work with applications that were not specifically designed to support HDFS snapshots and their limitations. Applications must be made snapshot-aware by calling a new HDFS API that sends up-to-date file length information to the NameNode (SuperSync/SuperFlush). It is hard to design such an application to work correctly, since the use of SuperSync across a cluster can overwhelm the NameNode, causing the entire cluster to fail or causing other processes to come to a halt. Moreover, applications making use of snapshots must avoid modifying files during the creation of the snapshot. HDFS snapshots apply only to the metadata (on the NameNode), so they do not work correctly while files are being written. This happens because the NameNode has difficulty handling even 1000 metadata updates per second. To avoid this, HDFS avoids sending the file length to the NameNode on every hsync/hflush, because that would overwhelm the NameNode. The effect of this implementation is that files that are being written continue to change inside the snapshot, meaning the snapshot is not actually a snapshot; it can contain data that was written after the snapshot was taken or can fail to contain data that was written and flushed before the snapshot was taken. HBASE SNAPSHOTS Because of the storage semantics, HBase snapshots cannot rely on the underlying HDFS snapshots and need to be built separately. This is unlike MapR Snapshots, where there is a common snapshot capability that applies to all data in the cluster. However, HBase snapshots are also not consistent, as each RegionServer snapshots its own data at different times. HBase snapshots exhibit the same problems as HDFS snapshots in terms of containing transactions committed after the snapshot and missing transactions committed before the snapshot. 5

The following table compares MapR Snapshots with HDFS and HBase snapshots: MAPR SNAPSHOTS HDFS SNAPSHOTS HBASE SNAPSHOTS CORRECTNESS AND COMPLETENESS Consistency Yes No No Supported applications All Custom (snapshot-aware, "SuperSync" API should be called) Data loss issues No Recent updates before the snapshot are lost Enabled by default Yes No No Files, Streams, and tables Yes No No in one snapshot Multiple tables in one snapshot Yes N/A No Access data in snapshot directly without recovery process Can be used with non- Hadoop applications HBase only; correct results require application cooperation Snapshots lose data when regions are merged Yes Yes No (need to clone table) Yes No No SCALABILITY AND PERFORMANCE IMPLICATIONS Memory overhead O(1) O(M) O(1) M = Modifications NameNode pressure No Yes (NameNode eventually dies) Yes (NameNode eventually dies) MANAGEMENT Scheduling policies Yes No No Retention policies Yes No No Volumes Yes No No SUMMARY Enterprise data storage solutions have offered the consistent snapshot capability for years, and only MapR offers the same for big data. MapR provides snapshots as first-class citizens of the platform, enabling any kind of application to benefit from it, out of the box and with no special application modifications. MapR Snapshots are proven, having been used by customers in production throughout different verticals since 2011. For more information visit mapr.com MapR and the MapR logo are registered trademarks of MapR and its subsidiaries in the United States and other countries. Other marks and brands may be claimed as the property of others. The product plans, specifications, and descriptions herein are provided for information only and subject to change without notice, and are provided without warranty of any kind, express or implied. Copyright 2017 MapR Technologies, Inc.