Case study: Building bi-directional DR Joep Piscaer, VMware vexpert, VCDX #101
Agenda Introduction Project description, goals, requirements, constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Agenda Introduction Project description, goals, requirements, constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Introduction Joep Piscaer Consulting Architect at OGD ict-diensten VMware VCDX5 #101, vexpert 2009, 2011, 2012 Know Veeam since 2007 and in love with them ever since (best. VMworld. parties. ever.)
Agenda Introduction Project goals, requirements and constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Project Goals Replace current (cloud-based) DR solution RTO is weeks; all data needs to be replicated back to our site RPO is ; current solution doesn t have all data (no OS-disks) Backups are not application-level consistent Develop bi-directional DR for two separate infrastructures We have a internal IT infrastructure but also a cloud service provider infrastructure; both are highly virtualized Eat your own dogfood Use hard- and software we known implement at customers regularly
Project Requirements RTO and RPO need to be improved RTO needs to be reduced from weeks to hours RPO needs to be reduced from to a day Transactionally consistent application state back-ups of mission critical and other VMs Be able to restore individual application and file items but also complete disks, virtual machines and clusters. Backup needs to be available locally for fast application item restores Backup needs to be available remotely for DR-purposes
Project Constraints WAN link too slow to do initial seed During first months of project, WAN capacity was only 20 Mbit/s After project completion, WAN capacity upped to 100 Mbit/s We used two 16x 1TB SuperMicro servers as temporary backup repository; offline initial seed by physically swapping them and mapping jobs to backup files Due to eat your own dogfood, we could only use a very limited set of products, including Veeam B&R
Customer Background 2 infrastructures on separate sites About 25 hosts in 3 clusters About 250 VMs total PXE-booted Citrix XenApp VMs (excluded from backup) Exchange and Zarafa groupware environments Additional scripts required to create consistent snapshots for Zarafa About 12 TB of data About 1000 users overall
Agenda Introduction Project description, goals, requirements, constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Overview Two Dell PowerEdge R510 servers with 24 GB RAM 12x 3TB nearline SAS disks in RAID-6 for 27 TB backup repository 4 gigabit NICs dedicated to iscsi for direct SAN access Two Veeam v6.0 consoles / installations Distributed backup architecture Multiple proxies Multiple repositories Enterprise Manager installed on one site
Overview We leveraged Application-aware image processing Instant VM Recovery and vpower NFS Virtual Labs SureBackup
Agenda Introduction Project description, goals, requirements, constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Overall architecture Used Veeam s distributed architecture for proxies and repositories, with a twist We installed two separate stretched backup infrastructures on two hosts so we can restore and continue local backup jobs if the remote site fails Two consoles to separate administrative domains for each IT operations teams Enterprise Manager as a single point of management for licensing and file level restores
Design Choices Each host has one proxy; Each host has two repositories One owned by the local console One owned by the remote console Local SQL Express databases Size of environment didn t require move to full SQL Server Totally independent backup environment required Use Application-aware image processing for consistency Optimize all jobs for WAN replication
Job Type We had no space available on primary SAN storage Replication job type replicates from proxy to proxy, requires standby host and can only store in native (VMX/VMDK) format on VMFS datastores Therefore, we chose regular job type We cannot replicate the local backup. This means each VM would be touched twice every day: once by the job that stores the VM on the local repository once by the job that stores the VM on the remote repository
Deduplication and Compression Compression set to best for all jobs Deduplication happens both at source before data is sent to repository, significantly improving performance and at target to achieve additional reduction for jobs with multiple VMs Block level deduplication optimized for WAN Using 256KB block size instead of default 1024KB size If anything happens to target we can seed the locally stored backups again using the two 16x 1TB SuperMicro servers CBT is designed to handle such usage cases.
Agenda Introduction Project description, goals, requirements, constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Job Type Advantages of Replication job type: Files stored in native VMware format Restores are parallel No need to choose between reverse and forward incremental No vpower NFS or Instant VM recovery needed VMs run at full I/O speed Number of VMs that can be powered on depends on infrastructure no dependency on backup server No additional migration like Storage vmotion needed after recovery Advanced features like re-ip, failback available
Job Type Pitfalls of using regular jobs: Files stored in Veeam file format; manual interaction required to restore This increases RTO and makes restores sequential Other solutions provide (semi-)parallel restores, keeping RTO down Need to make difficult choice between reversed and forward incremental Instant VM Recovery uses vpower NFS; recovered VMs will not run at full I/O speed affects number of VMs that can be powered on after total site failure Storage vmotion required to complete recovery of each VM No Re-IP and failback and other specific functionality available in Replication job type
Agenda Introduction Project description, goals, requirements, constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Backup Method We chose reverse incremental : Uses the least amount of disk space to store backups We wanted to maximize retention for jobs stored on local repository. We set retention for jobs stored on remote repository to two restore point as these backups are only for DR-purposes Calculations to produce reverse incremental done on (remote) repository Full backup file is rebuilt every day on remote repository Our physical backup servers have enough oomph to handle 3x I/O load Minimize stress on WAN link; only changed blocks are sent over WAN Last backup is always full No periodic full needed Imagine replicating a full backup of every VM over WAN every week
Why not replicate the local backup? Veeam s distributed architecture supports this use case: Proxy and repository at local site for local backups Proxy at local site and repository at remote site for remote backups No suitable Windows-based tool was found that does block level replication None of the tools integrate with Veeam s proxy architecture and aren t intelligent enough to understand the reverse incremental files to do smart changed block replication These tools take forever to create changed block indexes and begin replication which has a very negative effect on RPO Didn t want to use custom (Powershell) scripts makes your solution harder to manage, upgrade and support
Evaluation of lessons learned Re-evaluate job type: regular or replication Replication job type requires standby host and capacity on SAN We didn t anticipate this and didn t have SAN capacity to spare Regular jobs require touching source VM s twice Wasn t a problem at first, started to become cumbersome as we grew Re-evaluate forward or reverse incremental modes for replication Both have pros and cons for replication; choice is very hard to make We chose to maximize retention and use reverse incremental Re-evaluate Hot Add ( Virtual Appliance ) mode Much faster compared to Direct SAN Acces with thin provisioned disks Restores are much faster compared to Direct SAN Access mode
Agenda Introduction Project description, goals, requirements, constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Live Demo
Agenda Introduction Project description, goals, requirements, constraints High level overview: product and component overview Backup and DR Approach and architecture How to improve RTO (Recovery Time Objective) Find your bottlenecks Technical deep dive and live demo Q & A
Q & A