THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY
INTRODUCTION Driven by the need to remain competitive and differentiate themselves, organizations are undergoing digital transformations and becoming increasingly data driven, leading to the proliferation of modern applications like IoT and Customer 360 built on massively scalable NoSQL data platforms including Cassandra. This has created a critical data protection gap leaving organizations exposed to data loss, unprotected against Ransomware attacks, and unable to address compliance requirements. This ebook discusses the key data protection challenges and provides guidance on what to look for in a solution.
UNDERSTANDING COUCHBASE BACKUP & RECOVERY REQUIREMENTS A successful backup and recovery strategy is predicated on addressing numerous functional requirements, mainly Full automation requiring no scripting Incremental-forever backups Fast and granular point-in-time recovery Agentless architecture Massive scalability Backup storage optimization Application-aware backups and restores We will detail each of these requirements in more detail in the following chapters.
FULLY AUTOMATED BACKUP No one likes doing backups. And with Big Data platforms, you get 3 replicated copies of your data. So why back it up? If all you care about is protection against hardware failures, perhaps replication is good enough. But replication exacerbates more common problems like user errors, application corruption or Ransomware attacks. Ok, so maybe backups are good. Now, database technologists are inherently tech savvy. They can just write scripts to handle the process, right? Not really. There are a lot of considerations: Database awareness: In addition to HDSF files, metadata and schema need to be protected Complexity: Multiple nodes and replicas are difficult to manage for backup Changes: Described later in this ebook, multiple full backups are not practical. Any solution must be able to track net changes. Copying data: Scripting must copy all SSTable files from all nodes, copy in parallel, handle errors and retries, and manage cloud APIs and storage tiering. Versioning: Scripts must track historical backups to enable multiple restore points Errors: Scripts must handle a wide variety of errors and warnings Capacity: Backup storage consumption will be high due to backup of all replicas and periodic full backups. Restores: Restores will continue to be manual, error-prone, and resource intensive RPO: With different data requiring different RPOs, scripting for this can be a challenge. These factors make scripting an incredibly painful and ongoing process. They also significantly increase risk. Unless environments are small and expected to be relatively static, a decision to script should be taken very carefully. In most cases, a 3rd-party solution is going to perform better and be less work, lower risk, and lower TCO. In the end, an organization must decide if it is better to heavily dedicate engineering resources to backup scripting with all the inherent risks or use a proven 3rd-party platform. Performance: Backup and restore performance will be slow and limited to a single server on which the script is running 4
THE IMPORTANCE OF INCREMENTAL FOREVER Traditionally, organizations backed up data by creating a complete backup each week, followed by daily incremental backups. For recovery, the last full backup became the starting poin to which the subsequent incremental backups were added, thus generating the image that needed to be recovered. Consider a Couchbase application with several buckets with a total data size of 20 TB. Implementing a full weekly backup of a 20 TB data set is not feasible and can never meet any reasonable service-level agreement. The only practical way to backup Couchbase and other Big Data databases is with incrementalforever techniques. A full backup will be done only once, when the backup workflow is initially set up. After that, only changes will be incrementally copied to the destination cluster. For example, when a new Hive partition is added, only the files and metadata corresponding to the new Couchbase documents are added, only the new documents need be copied to the backup cluster. 5
FAST AND GRANULAR RECOVERY IS THE TICKET Another requirement in the Big Data world is that incremental changes must be immediately added to the full backup to create a complete image of the primary data at a particular time. This requirement ensures that recovery can be done in a single step without the lag time associated with creation of the image for recovery using a full backup and multiple incremental backup copies. Data recovery must be application- aware and granular. For the example of a Couchbase database, the backup cluster must be able to recover a single bucket. Additionally, all buckets comprising hundreds of documents might be backed up in a single workflow. The recovery workflow needs to be flexible enough to restore a single table from this backup workflow. 6
SPEED YOUR WAY WITH PARALLEL DATA TRANSFERS The architectures of all Big Data platforms, from Couchbase to Impala, specify a loosely coupled, shared-nothing architecture built on commodity hardware with direct-attached cheap storage. Implicit in this design is that data is actually distributed for storage across several nodes on the primary cluster for all these Big Data applications. Good performance, therefore, will require parallel-capable backup workflows. That is, each node containing data will be contacted independently and its data copied directly from the individual container nodes on the primary cluster hosting the data. That requirement suggests that a monolithic backup solution will not scale to Big Data levels because of several chokepoints in the design. Hence, any workable Couchbase backup implementation will have to run on a scale-out platform built on commodity hardware with direct-attached drives. All the nodes in the backup cluster will set up connections to all the nodes in the primary cluster so that data can be transferred in parallel. 7
AVOID AGENT OVERHEAD Big Data cluster configurations are in a state of constant flux.since commodity hardware is used to deploy platforms such as Couchbase, Hadoop, and Vertica, those clusters are configured to withstand or quickly recover from failures of various components such as drives, network adapters, and even nodes in the cluster. A traditional backup solution deploys agents on primary nodes to perform data transfers. That model is not operationally feasible in a Couchbase environment because new nodes are constantly commissioned and dead nodes are decommissioned. Managing and monitoring the availability of agents that are deployed on the individual nodes is a non-trivial task due to the number of nodes in a Couchbase cluster. There are also security implications in a datacenter where authorization from the security infrastructure team is typically required before additional daemons are deployed on the production nodes. For these operational reasons, any Couchbase data protection solution will need to incorporate an agentless model with no backup software installed on the nodes of the primary cluster. 8
LET THE SCALABLE CATALOG DO THE HARD WORK Backing up 10s or 100s of Couchbase clusters is non-trivial. The problem becomes harder as data sets grow to 100s of terabytes and number of stored documents is in the millions or billions. A centralized backup and recovery strategy makes most sense to ensure operational simplicity and reduce costs. The number of objects that need to be versioned and monitored in the Big Data world is in the millions, and the catalog to support these many objects will have to scale horizontally. For example, a Couchbase data store might easily have thousands of documents. Assuming a typical change rate, every incremental backup will add a large number of objects. These objects will have to be mapped to an appropriate recover point. A catalog will need extensive search capabilities and must scale to Big Data levels. Metadata of the objects will need to be stored, and the mutations of the metadata must be searchable across different versions and transitions. 9
BACKUP STORAGE OPTIMIZATION By its very nature Big Data is, well, big. Fully protecting your Couchbase environment can consume a huge amount of storage. To reduce these potential costs, organizations should squeeze data into the smallest possible footprint and move it to the lowest-cost storage tier, all without impacting performance. A robust deduplication and compression methodology can reduce data needs by up to 90%. When you consider the potential size of Couchbase backups, that can lead to huge savings. But in addition to reduction factor, deduplication and compression need to be high performance, since they do take time and compute cycles to complete, during which you are at increased risk. The best footprint optimization methodologies generate the smallest storage requirement in the shortest duration. Minimizing the footprint is only half the process, though. All storage cost is not created equal. Direct attached flash is going to cost considerably more than cloud storage, and long-term cloud storage will cost less than transactional cloud storage. But with data classification and policies (performance, compliance) often shifting, data placement needs to remain highly dynamic. Ensuring that your data is always on the lowestcost storage that still meets SLAs is a near impossibility for both human and script alike. Some form of automation is needed to fully optimize for cost and risk. 10
APPLICATION-AWARE BACKUPS AND RESTORES The Big Data world involves different applications with different types of data abstractions. For example, data in Couchbase is stored in buckets and documents while the abstraction layer for Hive focuses on databases, tables, and partitions. These differences impact backup requirements in several ways. For example, the user setting up workflows needs to interact with the backup system at the data abstraction layer supported by the application. Another requirement is that all the metadata and attributes associated with the abstraction layer also need to be backed up. For example, with Couchbase, we backup and restore bucket properties, view definition, and index definition. 11
WHY ARE WE PASSIONATE ABOUT THIS? We have witnessed numerous companies feel the pain of losing Big Data and that inspired us to create Imanis Data. Our software solution is built on a fully scale-out architecture and supports commodity hardware including direct-attached, network-attached, or cloud-attached, software-defined storage. We use an incremental-forever model to fetch only modified objects from the primary cluster. We are completely application-aware. For an application like Couchbase, we pull bucket metadata information like configuration, indexes, & views. Any recovery of the bucket will ensure that the metdata associated with the bucket, are applied to the recovered bucket. We are also agentless. No Imanis Data software need be installed on any of the nodes of the Couchbase cluster. The catalog is architected to host millions of versioned objects along with their attributes and properties. It is searchable with different attributes and regular expressions. In addition to backup/recovery, we also support cloud migration, test data management, and enhanced security and compliance. To learn more or to schedule an introductory meeting, please contact info@imanisdata.com.
Imanis Data, Inc. 2860 Zanker Road, Suite 109, San Jose CA 95134 +1.408.649.6338 imanisdata.com 2018 Imanis Data, Inc. All rights reserved. Imanis Data and the Imanis Data logo are trademarks of Imanis Data in the US and in other countries. Information subject to change without notice. All other trademarks and service marks are property of their respective owners.