Data Replication Whitepaper - PDF Free Download

Data Replication Whitepaper WHITEPAPER

Introduction: The Role of Replication in the Enterprise In our information-driven economy, a strong case could be made for the assertion that the value of the data stored in a company s storage servers or data center is far greater than the value of the storage servers on which they reside. Unfortunately, this fact is not realized until after a disaster such as a power outage, tornado, flood, or fire strikes, and an organization experiences firsthand how the loss of data is even more catastrophic than the material loss of goods. In many industries, the loss of data on a large scale could signal the end of a company in its present form. A common response to this vulnerability is to accept responsibility for the need to protect data, by copying it and moving it to an offsite repository where necessary, to guard against the original data s loss or corruption. Data duplication, or replication, as it is commonly known, is a very effective safeguard; however, to understand its full value, it is necessary to first understand the process of leveraging replication as a means of business continuity. The point to be realized is that there are many modes of replication, each with their strengths and weaknesses, and that the ideal solution lies in their tight integration, combining them or utilizing each at the appropriate time to leverage the cost and performance benefits of each. A typical replication deployment consists of both a local server, which acts as the primary data source and fields reads and writes from clients, and a remote server, which acts as a backup repository for all of the information stored in the primary. The link between the primary and secondary servers is established in various ways, depending on distance and permissible cost: the link could be over a dedicated network link, an optical connection, or a shared corporate network. WHITEPAPER 2

Figure 1: A typical replication setup In many deployments, the dedicated link between the primary and secondary servers is a significant portion of the system s cost, operating cost, and the expense of writing I/Os to the secondary server often exceeds the expense of writing them to the primary. However, the purpose of the secondary is made painfully clear when the primary server undergoes a catastrophic failure, and the administrator has to restart the company s business from the secondary. This is often done manually, by remounting various volumes from the secondary and restarting affected applications, although many storage and application software stacks can automate this process. Whether manual or automatic, an operation such as this is called a failover. In failover mode, the secondary server takes over as the recipient of I/Os from clients. The cost of keeping the system running may be higher during failover (due to the greater cost of networking a system to the secondary server), but for many businesses, this cost is far lower than the cost of not doing business for the duration of disaster recovery. WHITEPAPER 3

Figure 2: Failover to Secondary after Loss of Primary When the primary server has been restored -by recovering, repairing, or replacing it- the administrator will most likely want to return control back to the primary, for financial reasons if nothing else. This operation is quite similar to failover, and is called failback. The major difference is that failback is also done in a disconnected (i.e., offline) fashion, with volumes being reconnected and applications being restarted before I/Os are shipped. A Brief Overview of RTO and RPO The value that organizations attach to protecting their data through replication is widely different, typically due to the degree of the critical nature of the data, as well as the cost of keeping it protected. In response to such a market, the storage industry as a whole promotes several different methods ofreplication. What they have in common, is that each of these various methods trade off two critical parameters against cost: Recovery Time Objective and Recovery Point Objective. WHITEPAPER 4 To understand these parameters, RTO can be thought of as the duration of time that elapses between the failure of a primary site and assumption of control by the secondary through failover. Most companies can withstand downtime of a few minutes, while others, such as banks or other financial institutions, cannot afford to be down for a second.

In the same light, RPO can be thought of as the amount of data loss that can be tolerated, measured in units of time. When a primary storage server fails, the difference between point at which the last set of data was saved to the primary storage server and the last data save to the secondary storage server is the RPO. In situations such as source code control, data loss of a few minutes is acceptable, while for banking or airline reservation transactions, as is the case with RTO, even one second s worth of data loss can cause irreparable damage because of the amount of transactions that occur within that second. Primary Secondary time lag RPO RTO Secondary takes over Point of Disaster Replication starts Any data beyond this point has not reached the secondary Data replicated to secondary Recovery Period Figure 3: Replication, disaster, recovery, RPO and RTO Obviously, the ideal is to keep both of these benchmarks as close to zero as possible, but naturally, faster recovery times and a smaller time gap between the primary and backup data mean an increase the costs of deploying such a system. WHITEPAPER 5 Balancing RPO and RTO with Cost, Performance, and Data

Protection As stated above, a major factor in the choice between the major modes of replication, besides the costs of their implementation and operation, is the RPO and RTO that each offer. As detailed below, these two characteristics can differ greatly over the three major types of replication. Fault Tolerant I/O and High Availability Generally speaking, the costliest form of replication with an RPO and RTO of zero is called active-active clustering, or mirroring. In this form of replication, both the primary and the secondary servers are active and functioning at the same time; clients connect to both the servers, and the servers take care of maintaining consistency with each other at all times. When a server fails, it is equivalent to the failure of a redundant component, leaving the system untroubled. The second server seamlessly takes over the entire functionality of the system without necessitating a manual fail-over. This same functionality can be incorporated into a single enclosure, with dual controllers saving physical space and reducing the carbon footprint of an organization data center. Figure 4: Active/active mirroring or clustering WHITEPAPER 6

Synchronous Replication Next in the list is synchronous replication. In this replication method, only the primary server fields I/Os from clients. Every write that arrives to a client is also mirrored to the secondary server, and the write is signaled as being completed only when it has completed on both servers. In this manner, the applications running on the client machines are always guaranteed to have their writes written to both servers, and if any one server fails, the other is guaranteed to contain all the data that has been written so far. The RPO of synchronous replication is, therefore, zero. However, it may be necessary to manually failover a synchronous replication setup, and the RTO may be of the order of a few hours. Primary Volumes Replica primary responsible for blue volume secondary responsible for green volume data replicated in this direction Figure 5: Synchronous replication WHITEPAPER 7 An approach such as this is preferred for applications that require a critical level of data protection, such as market trading and airline reservation management

applications, because of its RPO of zero. There is no data loss between the primary and secondary servers, should the application be required to fail over to the secondary. While synchronous replication is a straightforward approach, it is not trouble-free; some of the larger potential drawbacks of synchronous replication are elaborated below. First, synchronous replication can suffer from write-order fidelity issues if data is cached above it. Data that is held in a write-back cache could be lost when disaster strikes, rendering the replication worthless. To counteract this potential problem, data must be replicated before it is stored in the cache for a SAN volume, and for NAS volumes it must work just below the file-system. Synchronous replication deployments can be expensive, primarily because a high-speed connection is required between the primary and secondary servers in order to avoid a decrease in I/O throughput performance. In fact, in deployments where the servers are geographically distant, such as between continents or on different coasts, the cost of a dedicated high-speed link may be out of reach. An additional concern is related to the issue of latency, and its negative impact on server performance in synchronous replication. Since writes must be completed on both the primary and secondary servers before being completed to the initiator, this can exert a significant potential impact on the speed with which I/Os are recorded to the host device, and by extension on application performance, as illustrated below. WHITEPAPER 8

Figure 6: Latency and performance of synchronous replication Finally, performance is often decreased further in implementations of synchronous replication when the secondary server is implementing snapshots for data backup. Because typical implementations of snapshots are slow, performance can be degraded by a factor of as much as twenty when snapshots are active, unless countermeasures are taken. Asynchronous Replication For applications that are not mission critical, and can tolerate a slight lag between writes to the primary and the secondary for the sake of improved bandwidth usage, asynchronous replication can be a better choice. In this method of replication, I/Os are not sent from the primary to the secondary server in line with their arrival from clients; rather, they are buffered at the primary for a brief interval before being sent to the secondary. WHITEPAPER 9 This buffering of replicated data improves bandwidth utilization and yields lower costs. Since buffered data can be compressed or otherwise optimized for size, the asynchronous replication link can be significantly slower than

that for a synchronous replication. Because of the lag, duplicate writes to the same block are only sent once, again reducing bandwidth use. The trade-off, however, is in the RPO: if disaster strikes, any open buffers on the primary side must necessarily be lost, and the secondary, by virtue of lagging behind the primary, will exhibit this brief data loss to clients. Figure 7: Asynchronous replication Asynchronous replication increases I/O performance on the primary, because latency is substantially reduced. The problem of write-order fidelity, however, becomes a larger issue in comparison to synchronous replication. Since the initiator is not directly in control of the order in which I/Os are sent to the secondary, it can no longer verify that dependent writes are flushed in the correct order to the replica. The role of the replication engine therefore becomes more important, since it is now the responsibility of the storage server to make sure that any application running on the initiator is able to recover smoothly from the secondary server if the primary becomes disabled. WHITEPAPER 10 In many contemporary implementations of asynchronous replication, the management of write-order fidelity is done by collecting I/Os arriving at the primary, and sending them to the secondary server in exactly the same order,

without any kind of framing or buffering. This is guaranteed to ensure writeorder fidelity, but it forgoes any potential performance or bandwidth gains that asynchronous replication might bring. Snapshot-assisted Replication (SAR) Snapshot-assisted replication can be considered as somewhat of a variation on asynchronous replication. The main difference between the two is that instead of relying on collected buffers to minimize data transfer rates, snapshotassisted replication relies on snapshots, or point-in-time images of a volume, to establish checkpoints from which to transfer data to the secondary server. A snapshot represents the sum total of all changes since the previous snapshot eliminating all duplicate writes in the intervening period. One major advantage of using a snapshot for replication is that it is a fixed target: once a snapshot has been taken, it never changes. Therefore, it can be copied to a remote server at leisure, without having to deal with locking or consistency issues. Figure 8: Snapshot-assisted replication WHITEPAPER 11 The RPO for this kind of replication, however, varies due to the frequency with which snapshots are taken. For this reason, it is properly called periodic replication rather than continuous replication, which is the term used to refer to the preceding three forms.

The RTO may often be significantly lower than synchronous and asynchronous replication, however. Because snapshots are application-consistent point-intime images of volumes, they can be mounted and accessed independently of the volume once they are taken. When a snapshot has been replicated (as volume writes) to the secondary server, a snapshot is taken on the secondary to stamp the point of application consistency. The primary snapshot from which this replica was created can now be deleted. If there is a need to perform failover and restart the application server from the secondary, all that needs to be done is to roll back on the secondary to the latest application-consistent snapshot, and the application will be guaranteed to work. Some storage appliances and software stacks can perform snapshot-assisted replication using a technology called delta-mounting snapshots, a mode of mounting exists that is inherently more powerful in a recovery situation. In this mode, a snapshot is mounted as a sequentially accessible volume that contains only the changes between the snapshot and its predecessor. All data that is accessible through a delta mount will consist of the totality of writes that occurred since the previous snapshot, and replicating these writes alone will be sufficient to arrive at the next application-consistent point on the secondary server. Figure 9: Mounting and Recovery from Delta Snapshot WHITEPAPER 12

SAR in an Extensible User-Mode Framework One of the greatest benefits of snapshot-assisted replication is realized when it becomes part of an extensible user-mode framework. Once done, it becomes possible to incorporate a variety of features into SAR, and thus integrate it with third-party products with ease. For example, it now becomes possible to deploy and upgrade snapshot-assisted replication without rebooting the machine, as a result of its position in the stack. Another one of the extensions that snapshot-assisted replication can support is data compression. Since data is read from a snapshot and written to the secondary outside the I/O path, efficient compression algorithms can be applied to greatly reduce the amount of bandwidth needed. In the same light, in addition to compressing data it also becomes possible to compress differential data using delta snapshot-assisted replication. To do so, the snapshot-assisted replication module reads not only the most recent snapshot data, but also the data from a previous snapshot data, and merely compresses the difference between the two. By doing this, it is possible to transmit differences to the remote side at byte-level granularity, compressed with highly efficient algorithms. This method is near the pinnacle in terms of minimizing the amount of data that is transacted between the primary and secondary servers. Aside from compression, another useful feature that snapshot-assisted replication provides is the ability to batch data replication, or to schedule it to happen only at a particular intervals, in order to schedule replication for times when the workload on the servers is lowest. Administrators can use this feature effectively to further decrease the total cost of ownership that replication incurs. In addition to compression, there are a number of methods to optimize the TCP/IP connections to maximize the throughput of the physical link. This enables organizations to overcome the costly addition of increased bandwidth and allow for an affordable DR solution. WHITEPAPER 13

Various new replication configurations become possible due to flexible implementation of snapshot-assisted replication. For example, one-to-many replication becomes much simpler, meaning that a single volume may be replicated to multiple replica mirrors, for even greater data protection. It is also now possible to replicate to a storage medium that is not vendor-locked, such as a generic tape backup device or a simple disk array (also known as JBOD, or just a bunch of disks ). This flexibility empowers an administrator to choose the most appropriate backup scheme for the organization, without being tied the replication mechanism that is in use. Conclusion As should be clear from the discussion above, a wide spectrum of replication solutions exist for deployment in storage networks and data centers, trading off RPO, RTO and cost against each other. An analysis of the different major modes of replication, along with their various merits and problem solving approaches, was presented here to highlight the suitability of each. StorTrends itx: Balancing Replication Modes for Optimum Performance It is likely that the ideal solution lies in a combination of all three, since all of the major modes (mirroring, synchronous, and asynchronous/snap-assisted) of replication have their benefits and drawbacks. The ideal storage solution would offer the capability to choose between each type of replication based on an accurate need-assessment, balancing the pros and cons to arrive at an optimal choice. Fortunately, StorTrends from American Megatrends provides exactly such a solution, at a price that is affordable even to small and medium sized (SMB) users. Recovery Point Objective(RTO)Recovery Time Objective(RTO)CostCostActive/ Active ClusteringSynchronous ReplicationAsynchronous ReplicationSnapshotassistedReplicationTape backup &Offline replicationactive/active ClusteringSynchronous ReplicationAsynchronous ReplicationSnapshotassistedReplicationTape backup &Offline replication WHITEPAPER 14

Figure 10: RPO and RTO of Major Replication Schemes The innovative features of StorTrendsmake error handling safer and easier, and manage link and power failure in a decisive manner. This unique software stack has been shown to decrease bandwidth requirements, and balance link and CPU utilization, while providing constant high performance. It is also amenable to various value-adds that may be added by third parties to enhance the performance and flexibility of its replication module. Whether operating on an SAN or NAS StorTrends storage device, data replication in StorTrends contains all three of the primary replication modes (synchronous, asynchronous, and snapshot-assisted) discussed in this paper. StorTrends also offers a form of mirroring in its high availability configuration. This flexibility makes it equally effective for units installed in remote offices as it is for units installed in the same rack in the data center. StorTrends: Putting It All Together WHITEPAPER 15 All these forms of replication are tightly integrated in StorTrends, making it easy to choose a replication mode that best serves a user s needs. Regardless of the type of replication being performed, various operations may be performed

simultaneously on both the primary and secondary servers, including the creation and deletion of snapshots, expansion of volume capacity, and even rollbacks and restores. The result of this tight cohesion of the various replication features of StorTrendsis added flexibility and power in the hands of the user. The solid integration that StorTrends applies its replication technology also has the effect of making replication transparent and invisible to the user until disaster strikes which, quite simply, is the way it should be. StorTrends gives administrators the ability to balance cost, performance, and data protection requirements by offering multiple modes of replication to suit a wide variety of needs, based on the proximity, connection speed, and critical nature of the devices and data involved. This dual dialect software (SAN and NAS) is preinstalled on all StorTrends servers, giving users the ability to configure with ease, thanks to its intuitive interface and powerful CLI scripting tools. Additionally, StorTrendsalso provides support for UPS power switchover notification in the event of a power failure. Figure 11: Balance RPO, RTO and Cost with StorTrends WHITEPAPER 16

Why StorTrends? Consider StorTrends for your data replication solution because StorTrends from American Megatrends, Inc. is Performance Storage with Proven Value. StorTrends SAN and NAS storage appliances are installed worldwide and trusted by companies and institutions in a wide range of industries including education, energy, finance, state & local government, healthcare, manufacturing, marketing, retail, R&D and many more. StorTrends enables users with the features and tools necessary to meet the challenges and demands of today s business environments by offering key network storage functionality such as unified storage, simplified management, business continuity, disaster recovery, high efficiency and virtualization support. Since 2001, StorTrends has built a community of over 1,000 satisfied customers thanks to award-winning products, integrated data protection and world-class support, while continuing to enhance its products with patented enterpriseclass features. For more information on StorTrends solutions from AMI, visit www.stortrends.com, email to sales@ami.com, or call 1-800-U-BUY-AMI. WHITEPAPER 17