Understanding high availability with WebSphere MQ

Size: px
Start display at page:

Download "Understanding high availability with WebSphere MQ"

Transcription

1 Mark Hiscock Software Engineer IBM Hursley Park Lab United Kingdom Simon Gormley Software Engineer IBM Hursley Park Lab United Kingdom May 11, 2005 Copyright International Business Machines Corporation All rights reserved. This whitepaper explains how you can easily configure and achieve high availability using IBM s enterprise messaging product, WebSphere MQ V5.3 and later. This paper is intended for: o Systems architects who make design and purchase decisions for the IT infrastructure and may need to broaden their designs to incorporate HA. o System administrators who wish to implement and configure HA for their WebSphere MQ environment. Table of Contents 1. Introduction High availability Implementing high availability with WebSphere MQ General WebSphere MQ recovery techniques Standby machine - shared disks HA clustering software When to use standby machine - shared disks When not to use standby machine - shared disks HA clustering active-standby configuration HA clustering active-active configuration HA clustering benefits z/os high availability options Shared queues (z/os only) WebSphere MQ queue manager clusters Extending the standby machine - shared disk approach When to use HA WebSphere MQ queue manager clusters...21

2 3.4.3 When not to use HA WebSphere MQ queue manager clusters Considerations for implementation of HA WebSphere MQ queue manager clusters HA capable client applications When to use HA capable client applications When not to use HA capable client applications Considerations for WebSphere MQ restart performance Long running transactions Persistent message use Automation File systems Comparison of generic versus specific failover technology Conclusion...31 Appendix A Available SupportPacs...33 Resources...34 About the authors...34 Page 2

3 1. Introduction With an ever increasing dependence on IT infrastructure to perform critical business processes, the availability of this infrastructure is becoming more important. The failure of an IT infrastructure results in large financial losses, which increases with the length of the outage [5]. The solution to this problem is careful planning to ensure that the IT system is resilient to any hardware, software, local or system wide failure. This capability is termed resilience computing, which addresses the following topics: o High availability o Fault tolerance o Disaster recovery o Scalability o Reliability o Workload balancing and stress This whitepaper addresses the most fundamental concept of resilience computing, high availability (HA). That is, An application environment is highly available if it possesses the ability to recover automatically within a prescribed minimal outage window [7]. Therefore, an IT infrastructure that recovers from a software or hardware failure, and continues to process existing and new requests, is highly available. Page 3

4 2. High availability The HA nature of an IT system is its ability to withstand software or hardware failures so that it is available as much of the time as possible. Ideally, despite any failure which may occur, this would be 100% of the time. However, there are factors, both planned and unplanned, which prohibit this from being a reality for most production IT infrastructures. These factors lead to the unavailability of the infrastructure, meaning the ideal availability (per year) can be measured as the percentage of the year for which the system was available. For example: Figure 1. Number 9 s availability per year Availability% Downtime per Year days hours minutes minutes seconds Figure 1 shows that a 30 second outage per year is called Six 9 s availability because of the percentage of the year the system was available. Factors that cause a system outage and reduce the number of 9 s up time, fall into two categories: those that are planned and those that are unplanned. Planned disruptions are either systems management (upgrading software or applying patches), or data management (backup, retrieval, or reorganization of data). Conversely, unplanned disruptions are system failures (hardware or software failures) or data failures (data loss or corruption). Maximizing the availability of an IT system is to minimize the impact of these failures on the system. The primary method is the removal of any single point of failure (SPOF) so that should a component fail, a redundant or backup component is ready to take over. Also, to ensure enterprise messaging solutions are made highly available, the software s state and data must be preserved in the event of a failure and made available again as soon as possible. The preservation and restoration of this data removes it as a single point of failure in the system. Some messaging solutions remove single points of failure, and make software state and data available, by using replication technologies. These may be in the form of asynchronous or synchronous replication of data between instances of the software in a network. However, these approaches are not ideal as asynchronous replication can cause duplicated or lost data and synchronous replication incurs a significant Page 4

5 performance cost as data is being backed up in real time. It is for these reasons that WebSphere MQ does not use replication technologies to achieve high availability. The next section describes methods for making a WebSphere MQ queue manager highly available. Each method describes a technique for HA and when you should and should not consider it as a solution. Page 5

6 3. Implementing high availability with WebSphere MQ This section discusses the various methods of implementing high availability in WebSphere MQ. Examples show when you can or cannot use HA. Standby machine shared disks and z/os high availability options describe HA techniques for distributed and z/os queue managers, respectively. WebSphere MQ Queue Manager clusters describes a technique available to queue manages on all platforms. HA capable client applications describes a client-side technique applicable on all platforms. By reading each section, you can select the best HA methodology for your scenario. This paper uses the following terminology: Machine A computer running an operating system. Queue manager A WebSphere MQ queue manager that contains queue and log data. Server A machine that runs a queue manager and other 3 rd party services. Private message queues These are queues owned by a particular queue manager and are only accessible, via WebSphere MQ applications, when the owning Queue manager is running. These queues are to be contrasted with shared messages queues (explained below), which are a particular type of queue only available on z/os. Shared message queues These are queues that reside in a Coupling Facility and are accessible by a number of queue managers that are part of a Queue Sharing Group. These are only available on z/os and are discussed later General WebSphere MQ recovery techniques On all platforms, WebSphere MQ uses the same general techniques for dealing with recovery of private message queues after a failure of a queue manager. With the exception of shared messages queues (see Shared queues ), messages are cached in memory and backed by disk storage if the volume of message data exceed the available memory cache. When persistent messaging is used, WebSphere MQ logs messages to disk storage. Therefore, in the event of a failure, the combination of the message data on disk plus the queue manager logs can be used to reconstruct the message queues. This restores the queue manager to a consistent state at the time just before the failure occurred. This recovery involves completing normal Unit or Work resolution, with in-flight messages being rolled back, in-commit messages being complete, and in-doubt messages waiting for coordinator resolution. The following sections describe how the above general restart process is used in conjunction with platform specific facilities, such as HACMP on AIX or ARM on z/os, to quickly restore message availability after failures. Page 6

7 WebSphere MQ also provides a mechanism for improving the availability of new messages by routing messages around a failed queue manager transparently to the application producing the messages. This is called Websphere MQ clustering and is covered in WebSphere MQ Queue Manager clusters. Finally on z/os, WebSphere MQ supports shared message queues that are accessible to a number of queue managers. Failure of one queue manager still allows the messages to be accessed by other queue managers. These are covered in z/os high availability options Standby machine - shared disks As described above, when a queue manager fails, a restart is required to make the private message queues available again. Until then, the messages stored on the queue manager will be stranded. Therefore, you cannot access them until the machine and queue manager are returned to normal operation. To avoid the stranded messages problem, stored messages need to be made accessible, even if the hosting queue manager or machine is inoperable. In the standby machine solution, a second machine is used to host a second queue manager that is activated when the original machine or queue manager fails. The standby machine needs to be an exact replica, at any given point in time of the master machine, so that when failure occurs, the standby machine can start the queue manager correctly. That is, the WebSphere MQ code on the standby machine should be at the same level, and the standby machine should have the same security privileges as the primary machine. A common method for implementing the standby machine approach is to store the queue manager data files and logs on an external disk system that is accessible to both the master and standby machines. WebSphere MQ writes its data synchronously to disk, which means a shared disk will always contain the most recent data for the queue manager. Therefore, if the primary machine fails, the secondary machine can start the queue manager and resume its last known good state. Page 7

8 Figure 2. An active-standby setup The standby machine is ready to read the queue manager data and logs from the shared disk and to assume the IP address of the primary machine [3]. A shared external disk device is used to provide a resilient store for queue data and queue manager logs so that replication of messages are avoided. This preserves the once and once only delivery characteristic of persistent messages. If the data was replicated to a different system, the messages stored on the queues have been duplicated to the other system, and once and once only delivery cannot be guaranteed. For instance, if data was replicated to a standby server, and the connection between the two servers fails, the standby assumes that the master has failed, takes over the master server s role, and starts processing messages. However, as the master is still operational, messages are processed twice, hence duplicated messages occur. This is avoided when using a shared hard disk because the data only exists in one physical location and concurrent access is not allowed. The external disk used to store queue manager data should also be RAID 1 enabled to prevent it being a single point of failure (SPOF) [8]. The disk device may also have multiple disk controllers and multiple physical connections to each of the machines, to provide redundant access channels to the data. In normal operation, the shared disk is mounted by the master machine, which uses the storage to run the queue manager in the same way as if it were a local disk, storing both the queues and the WebSphere 1 Using a RAID configuration protects against data loss, such as mirroring. Page 8

9 MQ log files on it. The standby machine cannot mount the shared disk and therefore, cannot start the queue manager because the queue manager data is not accessible. When a failure is detected, the standby machine automatically takes on the master machine s role, and as part of that process, mounts the shared disk and starts the queue manager. The standby queue manager replays the logs stored on the shared disk to return the queue manager to the correct state, and resumes normal operations. Note that messages on queues that are failed over to another queue manager retain their order on the queue. This failover operation can also be performed without the intervention of a server administrator. It does require external software, known as HA clustering, to detect the failure and initiate the failover process. Only one machine has access to the shared 2 disk partition at a time, and only one instance of the queue manager runs at any one time to protect data integrity of messages. The objective of the shared disk is to move the storage of important data (for example, queue data and queue manager logs) to a location external to the machine, so that when the master machine fails, another machine may use the data HA clustering software Much of the functionality in the standby machine configuration is provided by external software, often termed as HA clustering software [4]. This software addresses high availability issues using a more holistic approach than single applications, such as WebSphere MQ, can provide. It also recognizes that a business application may consist of many software packages and other resources, all of which need to be highly available. This is because another complication is introduced when a solution consists of several applications that have a dependency on each other. For example, an application may need access to both WebSphere MQ and a database, and may need to run on the same physical machine as these services. HA clustering provides the concept of resource groups, where applications are grouped together. When failure occurs in of one of the applications in the group, the entire group is moved to a standby server, satisfying the dependency of the applications. However, this only occurs if the HA clustering software fails to restart the application on its current machine. It is also possible to move the network address and any other operating system resources with the group so that the failover is transparent to the client. If an individual software package was responsible for its own availability, it may not be able to transfer to another physical machine and will not be able to move any other resources on which it is dependent. By using HA clustering to cope with these low level considerations, such as network address takeover, disk access, and application dependencies, the higher level applications are relieved of this complexity. Although there are several vendors providing HA clustering, each package tends to follow the same basic principles and provide a similar set of basic functionality. Some solutions, such as Veritas Cluster Server and SteelEye LifeKeeper, are also compatible with multiple platforms to provide a similar solution in heterogeneous environments. In the same way that WebSphere MQ removed the complexity of application connectivity from the programmer, HA clustering techniques help provide a simple, 2 A more accurate name would be switchable disks. Page 9

10 generic solution for HA. This means applications, such as messaging and data management, can focus on their core competencies leaving HA clustering to provide a more reliable availability solution than resource-specific monitors. HA clustering also covers both hardware and software resources, and is a proven, recognized technology used in many other HA situations. HA clustering products are designed to be scalable and extensible to cope with changing requirements. IBM s AIX HACMP product, SteelEye LifeKeeper, and Veritas Cluster Server scale up to 32 servers. HACMP, LifeKeeper, and Cluster Server have extensions available to allow replication of disks to a remote site for disaster recovery purposes When to use standby machine - shared disks The standby machine solution is ideal for messages that are delivered once and only once. For example, in billing and ordering systems, it is essential that messages are not duplicated so that customers are not billed twice, or sent two shipments instead of one. As HA clustering software is a separate product that sits along side existing applications, this methodology is also suited to convert an existing server, or set of servers to be highly available. It is possible to gradually convert a set of servers to be highly available. In large installations where there are many servers, HA clustering is a cost effective choice through the use of an n+1 configuration. In this approach, a single machine is used as a backup for a number of live servers. Hardware redundancy is reduced and therefore, cost is reduced, as only one extra machine is required to provide high availability to a number of active servers. As already shown, HA clustering software is capable of converting an existing application and its dependent resources to be highly available. It is, therefore, suited to situations where there are several applications or services that need to be made highly available. If those applications are dependent on each other, and rely on operating system resources, such as network addresses to function correctly, HA clustering is ideally suited When not to use standby machine - shared disks HA clustering is not always necessary when considering an HA solution. Although the examples given below are served by an HA clustering method, other solutions would serve just as well and it would be possible to utilize HA clustering at a later date if required. If the trapped messages problem is not applicable, such as there is no need to restart a failed queue manager with its messages intact, then shared disks are not necessary. This occurs if the system is only used for event messages that will be re-transmitted regularly, for messages that expire in a relatively short time, or for non-persistent messages (where an application is not relying on WebSphere MQ for assured delivery). For these situations, you can make a system highly available by using WebSphere MQ queue manager clustering only. This technology load balances messages and routes around failed servers. See WebSphere MQ Queue Manager clusters for more information on queue manager clusters. Page 10

11 In situations where it is not important to process the messages as soon as possible, then HA clustering may provide too much availability at too much of an expense. For example, if trapped messages can wait until an administrator restarts the machine, and hence the queue manager is restarted (using an internal RAID disk to protect the queue manager data), then HA clustering is considered too comprehensive of a solution. In this situation, it is possible to allow access for new messages using WebSphere MQ queue manager clustering, as in the case above. The shared disk solution requires the machines to be physically close to each other, as the distance from the shared disk device needs to be small. This makes it unsuitable for use in a disaster recovery solution. However, some HA clustering software can provide disaster recovery functionality. For example, IBM s HACMP package has an extension called HAGEO, which provides data replication to remote sites. By backing up data in this fashion, it is possible to retrieve it if a site wide failure occurs. However, the off-site data may not be the most up-to-date because the replication is often delayed by a few minutes. This is because instantaneous replication of data to an off-site location incurs a significant performance hit. Therefore, the more important the data, the smaller the time interval will be, but the greater the performance impact. Time and performance must be traded against each other when implementing a disaster recovery solution. Such solutions do not provide all of the benefits of the shared disk solution and are beyond the scope of this document. The following sections describe two possible configurations for HA clustering. These are termed active-active and active-standby configurations HA clustering active-standby configuration In a generic HA clustering solution, when two machines are used in an active standby configuration, one machine is running the applications in a resource group and the other is idle. In addition to network connections to the LAN, the machines also have a private connection to each other. This is either in the form of a serial link or a private Ethernet link. The private link provides a redundant connection between the machines for the purpose of detecting a complete failure. As previously mentioned, if a link between the machines fails, then both machines may try to become active. Therefore, the redundant link reduces the risk of communication failure between the two. The machines may also have two external links to the LAN. Again, this reduces the risk of external connectivity failure, but also allows the machines to have their own network address. One of the adapters is used for the service network address, such as the network address that clients use to connect to the service, and the other adapter has a network address associated with the physical machine. The service address is moved between the machines upon failure to provide HA transparency to any clients. The standby machine monitors the master machine via the use of heartbeats. These are periodic checks by the standby machine to ensure that the master machine is still responding to requests. The master machine also monitors its disks and the processes running on it to ensure that no hardware failure has occurred. For each service running on the machine, a custom utility is required to inform the HA clustering software that it is still running. In the case of WebSphere MQ, the SupportPacs describing HA configurations provide utilities to check the operation of queue Page 11

12 managers, which can easily be adapted for other HA systems. Details of these SupportPacs are listed in Appendix A. A small amount of configuration is required for each resource group to describe what should happen at start-up and shutdown, although in most cases this is simple. In the case of WebSphere MQ, this could be a start up script containing commands to start the queue manager (for example, strmqm), listener (for example, runmqlsr), or any other queue manager programs. A corresponding shutdown script is also needed, and depending on the HA clustering package in use, a number of other scripts may be required. Samples for WebSphere MQ are provided with the SupportPacs described in Appendix A. As the heartbeat mechanism is the primary method of failure detection, if a heartbeat does not receive a response, the standby machine assumes that the master server has failed. However, heartbeats may not respond because of a number of reasons, such as an overloaded server, or communication failure. There is a possibility that the master server will resume processing at a later stage, or is still running. This can lead to duplicate messages in the system and is not desired. Managing this problem is also the role of the HA clustering package. For example, RedHat Cluster services and IBM s HACMP work around this problem by having a watchdog timer with a lower timeout than the cluster. This ensures that the machine reboots itself before another machine in the cluster takes over its role. Programmable power supplies are also supported, so other machines in the cluster can power cycle the affected machine, to ensure that it is no longer operational before starting the resource group. Essentially, the machines in the cluster have the capability to turn the other machines off. Some HA clustering software suites also provide the capability to detect other types of failure, such as system resource exhaustion, or process failure, and try to recover from these failures locally. For WebSphere MQ, you can implement on AIX using the appropriate SupportPac (see Appendix A) to restart a queue manager locally, which is not responding. This can avoid the more time consuming operation of completely moving the resource group to another server. You should design the machines used in HA clustering to have identical configurations to each other. This includes installed software levels, security configurations, and performance capabilities, to minimize the possibility of resource group start-up failure. This ensures that machines in the network all have the capability to take on another machine s role. Note that for active-standby configurations, only one instance of an application is running at any one moment and therefore, software vendors may only charge for one instance of the application, as is the case for WebSphere MQ HA clustering active-active configuration It is also possible to run services on the redundant machine in what is termed an active active configuration. In this mode, the servers are both actively running programs and acting as backups for each other. If one server fails, the other continues Page 12

13 to run its own services, as well as the failed server s. This enables the backup server to be used more effectively, although when a failure does occur, the performance of the systems is reduced because it has taken on extra processing. In Figure 3, the second active machine runs both queue managers if a failure occurs. Figure 3. An active-active configuration In larger installations, where several resource groups exist and more than one server needs to be made highly available, it is possible to use one backup machine to cover several active servers. This setup is known as an n+1 configuration, and has the benefit of reduced redundant hardware costs, because the servers do not have a dedicated backup machine each. However, if several servers fail at the same time, the backup machine may become overloaded. These extra costs must be weighed up against the potential cost of more than one server failing, and more than one backup machine being required HA clustering benefits HA clustering software provides the capability to perform controlled failover of resource groups. This allows administrators to test the functionality of a configured system, and also allow machines to be gracefully removed from an active cluster. This can be for maintenance purposes, such as hardware and software upgrades or data backup. It also allows failed servers, once repaired, to be placed back in the cluster and to resume their services. This is known as fail-back [4]. A controlled failover operation also results in less downtime because the cluster does not need to detect the Page 13

14 failure. There is no need to wait for the cluster timeout. Also, as the applications, such as WebSphere MQ, are stopped in a controlled manner, the start up time is reduced because there is no need for log replay. Using the abstract resource groups makes it possible for a service to remain highly available. This occurs when the machine that is normally running the services has been removed from the cluster. This is only true as long as the other machines have comparable software installed and access to the same data, meaning any machine can run the resource group. The modular nature of resource groups also helps the gradual uptake of HA clustering in an existing system and easily allows services to be added at a later date. This also means that in a large queue manager installation, you can convert mission critical queue managers to be highly available first, and later convert the less critical queue managers, or not at all. Many of the requirements for implementing HA clustering are also desirable in more bespoke, or product-centric HA solutions. For example, RAID disk arrays [8], extra network connections and redundant power supplies all protect against hardware failure. Therefore, improving the availability of a server results in additional cost, whether a bespoke or HA clustering technique is used. HA clustering may require additional hardware over and above some application specific HA solutions, but this enables a HA clustering approach to provide a more complete HA solution. You can easily extend the configuration of HA clustering to cover other applications running on the machine. The availability of all services is provided via a standard methodology and presented through a consistent interface rather than being implemented separately by each service on the machine. This in turn reduces complexity and staff training times and reduces errors being introduced during administration activities. By using one product to provide an availability solution, you can take a common approach to decision making. For instance, if a number of the servers in a cluster are separated from the others by network failure, an unanimous decision is needed to decide which servers should remain active in the cluster. If there were several HA solutions in place (such as each product uses its own availability solution), each with separate quorum algorithms 3, then it is possible that each algorithm has a different outcome. This could result in an invalid selection of active servers in the cluster that may not be able to communicate. By having a separate entity, in the form of the HA clustering software, to decide which part of the cluster has the quorum, only one outcome is possible, and the cluster of servers continues to be available. Summary The shared disk solution described above is a robust approach to the problem of trapped messages, and allows access to stored messages in the event of a failure. However, there will be a short period of time where there is no access to the queue manager while the failure is being detected, and the service is being transferred to the standby server. It is possible during this time to use WebSphere MQ clustering to provide access for new messages because its load balancing capabilities will route 3 A quorum is the minimum number of members of a deliberative body necessary to conduct the business of that group. Page 14

15 messages around the failed queue manager to another queue manager in the cluster. How to use HA clustering with WebSphere MQ clustering is described in When to use WebSphere MQ queue manager clusters. Page 15

16 3.3. z/os high availability options z/os provides a facility for operating system restart of failed queue managers called Automatic Restart Manager (ARM). It provides a mechanism, via ARM policies, for a failed queue manager to be restarted in place on the failing logical partition (LPAR). Or, in the case of an LPAR failure, started on a different LPAR along with other subsystems and applications grouped together, such that the subsystem components provide the overall business solution can be restarted together. In addition, with a parallel sysplex, Geographically Dispersed Parallel Sysplex (GDPS) provides the ability for automatic restart of subsystems, via remote DASD copying techniques, in the event of a site failure. The above techniques are restart techniques that are similar to those discussed earlier for distributed platforms. We will now look at a capability which maximizes the availability of message queues in the event of queue manager failures that does not require queue manager restart Shared queues (z/os only) WebSphere MQ shared queues is an exploitation of the z/os-unique Coupling Facility (CF) technology that provides high-speed access to data across a sysplex via a rich set of facilities to store and retrieve data. WebSphere MQ stores shared message queues in the Coupling Facility, and this in turn, means that unlike private message queues, they are not owned by any single queue manager. Queue managers are grouped into Queue Sharing Groups (QSGs), analogous to Data Sharing Groups with data-sharing DB2. All queue managers within a QSG can access shared message queues for putting and getting of messages via the WebSphere MQ API. This enables multiple putters and getters on the same shared queue from within the QSG. Also WebSphere MQ provides peer recovery such that inflight shared queue messages are automatically rolled back by another member of the QSG in the event of a queue manager failure. WebSphere MQ still uses its logs for capturing persistent message updates so that in the extremely unlikely event of a CF failure, you can use the normal restart procedures to restore messages. In addition, z/os provides system facilities to automatically duplex the CF structures used by WebSphere MQ. The combination of these facilities provides WebSphere MQ shared message queues with extremely high availability characteristics. Figure 4 shows three queue managers: QM1, QM2 and QM3 in the QSG GRP1 sharing access to queue A in the coupling facility. This setup allows all three queue managers to process messages arriving on queue A. Page 16

17 Figure 4. Three queue managers in a QSG share queue A on a Coupling Facility GRP1 QM 2 QM 1 QM 3 Q A Coupling Facility A further benefit of using shared queues is utilizing shared channels. You can use shared channels in two different scenarios to further extend the high availability of WebSphere MQ. First, using shared channels, an external queue manager can connect to a specific queue manager in the QSG using channels. It can then put messages to the shared queue via this queue manager. This allows for queue managers in a distributed environment to utilize the HA functionality provided by shared queues. Therefore, the target application of messages put by the queue manager can be any of those running on a queue manager in the QSG. Second, you can use a generic port so that a channel connecting to the QSG could be connected to any queue manager in the QSG. If the channel loses its connection (because of a queue manager failure), then it is possible for the channel to connect to another queue manager in the QSG by simply reconnecting to the same generic port Benefits of shared message queues The main benefit of a shared queue is its high availability. There are numerous customer selectable configuration options for CF storage, ranging from running on standalone processors with their own power supplies to the Internal Coupling Facility (ICF) that runs on spare processors within a general zseries server. Another key factor is that the Coupling Facility Control Code (CFCC) runs in its own LPAR, where it is isolated from any application or subsystem code. In addition, it naturally balances the workload between the queue managers in the QSG. That is, a queue manager will only request a message from the shared queue when the application, which is processing messages, is free to do so. Therefore, the availability of the messaging service is improved because queue managers are not flooded by messages directly. Instead, they consume messages from the shared queue when they are ready to do so. Also, should greater message processing performance be required, you can add extra queue managers to the QSG to process more incoming messages. With persistent messages, both private and shared, the message processing limit is constrained by the speed of the log. With shared message queues, each queue manager uses its own log Page 17

18 for updates. Therefore, deploying additional queue managers to process a shared queue means the total logging cost is liquidated gradually over a number of queue managers. This provides a highly scalable solution. Conversely, if a queue manager requires maintenance, you can remove it from the QSG, leaving the remaining queue managers to continue processing the messages. Both the addition and removal of queue managers in a QSG can be performed without disrupting the already existing members. Lastly, should a queue manager fail during the processing of a Unit of Work, the other members of the QSG will spot this and Peer Recovery is initiated. That is, if the unit of work was not completed by the failed queue manager, another queue manager in the QSG will complete the processing. This arbitration of queue manager data is achieved via hardware and microcode on z/os. This means that the availability of the system is increased as the failure of any one queue manager does not result in trapped messages or inconsistent transactions. This is because Peer Recovery either completes the transaction or rolls it back. For more information on Peer Recovery and how to configure it, see z/os Systems Administration Guide [6]. The benefits of shared queues are not solely limited to z/os queue managers. Although you cannot setup shared queues in a distributed environment, it is possible for distributed queue managers to place messages onto them through a member of the QSG. This allows for the QSG to process a distributed application s message in a z/os HA environment Limitations of shared message queues With WebSphere MQ V5.3, physical shared messages are limited to be less than 63KB in size. Any application that attempts to put a message greater than this limit receives an error on the MQPUT call. However, you can use the message grouping API to construct a logical message greater than 63KB, which consists of a number of physical segments. The Coupling Facility is a resilient and durable piece of hardware, but it is a single point of failure in this high availability configuration. However, z/os provides duplexing facilities, where updates to one CF structure are automatically propagated to a second CF. In the unlikely event of failure of the primary CF, z/os automatically switches access to the secondary, while the primary is being rebuilt. This system-managed duplexing is supported by WebSphere MQ. While the rebuild is taking place, there is no noticeable application effect. However, this duplexing will clearly have an effect on overall performance. Finally, a queue manager can only belong to one QSG and all queue managers in a QSG must be in the same sysplex. This is a small limitation on the flexibility of QSGs. Also a QSG can only contain a maximum of 32 queue managers. For more information on shared queues, see WebSphere MQ for Z/OS Concepts and Planning Guide [1]. Page 18

19 3.4. WebSphere MQ queue manager clusters A WebSphere MQ queue manager cluster is a cross platform workload balancing solution that allows WebSphere MQ messages to be routed around a failed queue manager. It allows a queue to be hosted across multiple queue managers, thus allowing an application to be duplicated across multiple machines. It provides a highly available messaging service allowing incoming messages to be forwarded to any queue manager in the cluster for application processing. Therefore, if any queue manager in the cluster fails, new incoming messages continue to be processed by the remaining queue managers. In Figure 5, an application puts a message to a cluster queue on QM2. This cluster queue is defined locally on QM1, QM4 and QM5. Therefore, one of these queue managers will receive the message and process it. Figure 5. Queue managers 1,4, and 5 in the cluster receive messages in order cluster Queue Local Queue Application QM 1 QM 2 QM 3 QM 6 QM 4 cluster 1 QM 5 By balancing the workload between QM1, QM4, and QM5, an application is distributed across multiple queue managers making it highly available. If a queue manager fails, the incoming messages are balanced among the remaining queue managers. While WebSphere MQ clustering provides continuous messaging for new messages, it is not a complete HA solution because it is unable to handle messages that have already been delivered to a queue manager for processing. As we have seen above, if a queue manager fails, these trapped private messages are only processed when the queue manager is restarted. However, by combining WebSphere MQ clustering with the recovery techniques covered above, you can create an HA solution from both new and existing messages. The following section shows this in action in a distributed shared disk environment. Page 19

20 Extending the standby machine - shared disk approach By hosting cluster queue managers on active-standby or active-active setups, trapped messages, on private or cluster queues, are made available when the queue manager is failed over to a standby machine and restarted. The queue manager will be failed over and will begin processing messages within minutes instead of the longer amount of time it would take to manually recover and repair the failed machine or failed queue manager in the cluster. The added benefit of combining queue manager clusters with HA clustering is that the high availability nature of the system becomes transparent to any clients using it. This is because they are putting messages to a single cluster queue. If a queue manager in the cluster fails, the client s outstanding requests are processed when the queue manager is failed over to a backup machine. In the meantime, the client needs to take no action because its new requests will be routed around the failure and processed by another queue manager in the cluster. The client must only be tolerant if its requests are taking slightly longer than normal to be returned in the event of a failover. Figure 6 shows each queue manager in the cluster in an active-active, standby machine-shared disk configuration. The machines are configured with separate shared disks for queue manager data and logs to decrease the time required to restart the queue manager. See Considerations for WebSphere MQ restart performance for more information. Figure 6. Queue managers 1,4, and 5 have active standby machines Cluster Queue Local Queue Application QM 1 QM 2 QM 3 QM QM log log QM 6 QM log QM 4 cluster 1 QM 5 In this example, if queue manager 4 fails, it fails over to the same machine as queue manager 3, where both queue managers will run until the failed machine is repaired. Page 20

21 When to use HA WebSphere MQ queue manager clusters Because this solution is implemented by combining external HA clustering technology with WebSphere MQ queue manager clusters, it provides the ultimate high availability configuration for distributed WebSphere MQ. It makes both incoming and queued messages available and also fails over not only a queue manager, but also any other resources running on the machine. For instance, server applications, databases, or user data can fail over to a standby machine along with the queue manager. When using HA WebSphere MQ clustering in an active-standby configuration, it is a simpler task to apply maintenance or software updates to machines, queue managers, or applications. This is because you can first update a standby machine, then a queue manager can fail over to it, ensuring that the update works correctly. If it is successful, you can update the primary machine and then the queue manager can fail back onto it. HA WebSphere MQ queue manager clusters also greatly reduce the administration of the queue managers within it, which in turn reduces the risk of administration errors. Queue managers that are defined in a cluster do not require channel or queue definitions setup for every other member of the cluster. Instead, the cluster handles these communications and propagates relevant information to each member of the cluster through a repository. HA WebSphere MQ queue manager clusters are able to scale applications linearly because you can add new queue managers to the cluster to aid in the processing of incoming messages. Conversely, you can remove queue managers from the cluster for maintenance and the cluster can still continue to process incoming requests. If the queue manager s presence in the cluster is required, but the hardware must be maintained, then you can use this technique in conjunction with failing the queue manager over to a standby machine. This frees the machine, but keeps the queue manager running. It is also possible for administrators to write their own cluster workload exits. This allows for a finer control of how messages are delivered to queue managers in the cluster. Therefore, you can target messages at machines in different ratios based on the performance capabilities of the machine (rather than in a simple round robin fashion) When not to use HA WebSphere MQ queue manager clusters HA WebSphere MQ queue manager clusters require additional proprietary HA hardware (shared disks) and external HA clustering software (such as HACMP). This increases the administration costs of the environment because you also need to administer the HA components. This approach also increases the initial implementation costs because extra hardware and software are required. Therefore, balance these initial costs with the potential costs incurred if a queue manager fail and messages become trapped. Note that non-persistent messages do not survive a queue manager failover. This is because the queue manager restarts once it has been failed over to the standby machine, causing it to process its logs and return to its most recent known state. At Page 21

22 this point, non persistent messages are discarded. Therefore, if your application requires non-persistent messages, take into account this factor. If trapped messages are not a problem for the applications (for example, the response time of the application is irrelevant or the data is updated frequently), then HA WebSphere MQ queue manager clusters are probably not required. That is, if the amount of time required to repair a machine and restart its queue manager is acceptable, then having a standby machine to take over the queue manager is not necessary. In this case, it is possible to implement WebSphere MQ queue manager clusters without any additional HA hardware or software Considerations for implementation of HA WebSphere MQ queue manager clusters When configuring an active-active or active-standby setup in a cluster, administrators should test to ensure that the failover of a given node works correctly. Nodes should be failed over, when and where possible, to backup machines to ensure the failover processes work as designed and that no problems are encountered when a failover is actually required. Perform this procedure at the discretion of the administrators. It may cause problems or outages in a future production environment if failover does not happen smoothly. As with queue manager clusters, do not code WebSphere MQ applications as machine or queue manager specific, such as relying on resources only available to a single machine. This is because when applications are failed over to a standby machine, along with the queue manager they are running on, they may not have access to these resources. To avoid these administrative problems, machines should be as equal as possible with respect to software levels, operating system environments, and security settings. Therefore, any failed over applications should have no problems running. Avoid message affinities when programming applications. This is because there is no guarantee that messages put to the cluster queue will be processed by the same queue manager every time. It is possible to use the MQ Open Option BIND_ON_OPEN to ensure an application s messages are always delivered to the same queue manager in the cluster. However, an application performing this operation incurs reduced availability because this queue manager may fail during message processing. In this case, the application must wait until the queue manager is failed over to a backup machine before it can begin processing the applications requests. If affinities had not been used, then no delay in message processing would be experienced. Another queue manager in the cluster would continue processing any new requests. Application programmers should avoid long running transactions in their applications. This is because these will greatly increase the restart time of the queue manager when it is failed over to a standby machine. See Considerations for WebSphere MQ restart times for more information. When implementing a WebSphere MQ cluster solution, whether for an HA configuration or for normal workload balancing, be careful to have at least two full cluster repositories defined. These repositories should be on machines that are highly Page 22

23 available. For example, they have redundant power supplies, network access and hard disks, and are not heavily loaded with work. Repositories are vital to the cluster because they contain cluster wide information that is distributed to each cluster member. If both of these repositories are lost, it is impossible for the cluster to propagate any cluster changes, such as new queues or queue managers. However, the cluster continues to function with each member s partial repositories until the full repositories are restored. Page 23

24 3.5. HA capable client applications You can achieve high availability on the client side rather than using HA clustering, HA WebSphere MQ queue manager clusters, or shared queue server side techniques as previously described. HA capable clients are an inexpensive way to implement high availability, but usually it results in a large client with complex logic. This is not ideal and a server side approach is recommended. However, HA capable clients are discussed here for completeness. Most occurrences of a queue manager failure result in a connection failure with the client. Even if the queue manager is returned to normal operation, the client disconnects and remains so until the code used to connect the client to the queue manager is executed again. One possible solution to the problem of a server failure is to design the client applications to reconnect, or connect to a different, but functionally identical server. The client s application logic has to detect a failed connection and reconnect to another specified server. The method of detecting and handling a failed connection depends on the MQ API in use. MQ JMS, for instance, provides an exception listener mechanism that allows the programmer to specify code to be run upon a failure event. The programmer can also use Java try catch blocks to allow failures to be handled during code execution. The MQI API reports a failure upon the next function call that requires communication with the queue manager. In this scenario, it is the programmer s responsibility to resolve the failure. The management of the failure depends on the type of application and also, if there are any other high availability solutions in place. A simple reconnect to the same queue manager may be attempted, and if successful, the application can resume processing. You can configure the application with a list of queue managers that it may connect to. Upon failure, it can reconnect to the next queue manager in the list. In an HA clustering solution, clients still experience a failed connection if a server is failed-over to a different physical machine. This is because it is not possible to move open network connections between servers. The client also may need to be configured to perform several reconnect attempts to the server, and/or wait a period of time to allow time for the server to restart. If the application is transactional, and the connection fails mid-transaction, the entire transaction needs to be re-executed when a new connection is established. This is because WebSphere MQ queue managers will rollback any uncommitted work at start-up time. You can supplement many server-side HA solutions with the use of client side application code designed to cope with the temporary loss, or need to reconnect to a queue manager. A client that contains no extra code may need user intervention, or even need to be completely restarted to resume full functionality. There is obviously extra effort required to code the client application to be HA aware, but the end result is a more autonomous client. Page 24

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Warm Standby...2 The Business Problem...2 Section II:

More information

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002 Maximum Availability Architecture: Overview An Oracle White Paper July 2002 Maximum Availability Architecture: Overview Abstract...3 Introduction...3 Architecture Overview...4 Application Tier...5 Network

More information

DB2 Data Sharing Then and Now

DB2 Data Sharing Then and Now DB2 Data Sharing Then and Now Robert Catterall Consulting DB2 Specialist IBM US East September 2010 Agenda A quick overview of DB2 data sharing Motivation for deployment then and now DB2 data sharing /

More information

Data Sheet: Storage Management Veritas Storage Foundation for Oracle RAC from Symantec Manageability and availability for Oracle RAC databases

Data Sheet: Storage Management Veritas Storage Foundation for Oracle RAC from Symantec Manageability and availability for Oracle RAC databases Manageability and availability for Oracle RAC databases Overview Veritas Storage Foundation for Oracle RAC from Symantec offers a proven solution to help customers implement and manage highly available

More information

VERITAS Volume Replicator. Successful Replication and Disaster Recovery

VERITAS Volume Replicator. Successful Replication and Disaster Recovery VERITAS Volume Replicator Successful Replication and Disaster Recovery V E R I T A S W H I T E P A P E R Table of Contents Introduction.................................................................................1

More information

Downtime Prevention Buyer s Guide. 6 QUESTIONS to help you choose the right availability protection for your applications

Downtime Prevention Buyer s Guide. 6 QUESTIONS to help you choose the right availability protection for your applications Downtime Prevention Buyer s Guide 6 QUESTIONS to help you choose the right availability protection for your applications Question 6 questions to help you choose the right availability protection for your

More information

IBM GDPS V3.3: Improving disaster recovery capabilities to help ensure a highly available, resilient business environment

IBM GDPS V3.3: Improving disaster recovery capabilities to help ensure a highly available, resilient business environment Marketing Announcement February 14, 2006 IBM GDPS V3.3: Improving disaster recovery capabilities to help ensure a highly available, resilient business environment Overview GDPS is IBM s premier continuous

More information

InterSystems High Availability Solutions

InterSystems High Availability Solutions InterSystems High Availability Solutions Version 2018.1.1 2018-08-13 InterSystems Corporation 1 Memorial Drive Cambridge MA 02142 www.intersystems.com InterSystems High Availability Solutions InterSystems

More information

What's in this guide... 4 Documents related to NetBackup in highly available environments... 5

What's in this guide... 4 Documents related to NetBackup in highly available environments... 5 Contents Chapter 1 About in this guide... 4 What's in this guide... 4 Documents related to NetBackup in highly available environments... 5 Chapter 2 NetBackup protection against single points of failure...

More information

IBM MQ Appliance HA and DR Performance Report Model: M2001 Version 3.0 September 2018

IBM MQ Appliance HA and DR Performance Report Model: M2001 Version 3.0 September 2018 IBM MQ Appliance HA and DR Performance Report Model: M2001 Version 3.0 September 2018 Sam Massey IBM MQ Performance IBM UK Laboratories Hursley Park Winchester Hampshire 1 Notices Please take Note! Before

More information

Veritas InfoScale Enterprise for Oracle Real Application Clusters (RAC)

Veritas InfoScale Enterprise for Oracle Real Application Clusters (RAC) Veritas InfoScale Enterprise for Oracle Real Application Clusters (RAC) Manageability and availability for Oracle RAC databases Overview Veritas InfoScale Enterprise for Oracle Real Application Clusters

More information

MIMIX. Version 7.0 MIMIX Global Operations 5250

MIMIX. Version 7.0 MIMIX Global Operations 5250 MIMIX Version 7.0 MIMIX Global Operations 5250 Published: September 2010 level 7.0.01.00 Copyrights, Trademarks, and tices Contents Version 7.0 MIMIX Global Operations 5250 Who this book is for... 5 What

More information

Geographic LVM: Planning and administration guide

Geographic LVM: Planning and administration guide High Availability Cluster Multi-Processing XD (Extended Distance) Geographic LVM: Planning and administration guide SA23-1338-07 High Availability Cluster Multi-Processing XD (Extended Distance) Geographic

More information

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper By Anton Els

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper By Anton Els Disaster Recovery Solutions for Oracle Database Standard Edition RAC A Dbvisit White Paper By Anton Els Copyright 2017 Dbvisit Software Limited. All Rights Reserved V3, Oct 2017 Contents Executive Summary...

More information

IBM Tivoli System Automation for z/os

IBM Tivoli System Automation for z/os Policy-based self-healing to maximize efficiency and system availability IBM Highlights Provides high availability for IBM z/os Offers an advanced suite of systems and IBM Parallel Sysplex management and

More information

IBM DB2 Analytics Accelerator High Availability and Disaster Recovery

IBM DB2 Analytics Accelerator High Availability and Disaster Recovery Redpaper Patric Becker Frank Neumann IBM Analytics Accelerator High Availability and Disaster Recovery Introduction With the introduction of IBM Analytics Accelerator, IBM enhanced for z/os capabilities

More information

Microsoft SQL Server on Stratus ftserver Systems

Microsoft SQL Server on Stratus ftserver Systems W H I T E P A P E R Microsoft SQL Server on Stratus ftserver Systems Security, scalability and reliability at its best Uptime that approaches six nines Significant cost savings for your business Only from

More information

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Business Continuity and Disaster Recovery. Ed Crowley Ch 12 Business Continuity and Disaster Recovery Ed Crowley Ch 12 Topics Disaster Recovery Business Impact Analysis MTBF and MTTR RTO and RPO Redundancy Failover Backup Sites Load Balancing Mirror Sites Disaster

More information

Getting the Best Availability from MQ on z/os by using Shared Queues Session Paul S Dennis

Getting the Best Availability from MQ on z/os by using Shared Queues Session Paul S Dennis Getting the Best Availability from MQ on z/os by using Shared Queues Session 12607 Paul S Dennis dennisps@uk.ibm.com Agenda What are Shared Queues Large messages with DB2 SMDS Structures Persistence and

More information

Highly Available Forms and Reports Applications with Oracle Fail Safe 3.0

Highly Available Forms and Reports Applications with Oracle Fail Safe 3.0 Highly Available Forms and Reports Applications with Oracle Fail Safe 3.0 High Availability for Windows NT An Oracle Technical White Paper Robert Cheng Oracle New England Development Center System Products

More information

Simplifying Downtime Prevention for Industrial Plants. A Guide to the Five Most Common Deployment Approaches

Simplifying Downtime Prevention for Industrial Plants. A Guide to the Five Most Common Deployment Approaches Simplifying Downtime Prevention for Industrial Plants A Guide to the Five Most Common Deployment Approaches Simplifying Downtime Prevention for Industrial Plants: A Guide to the Five Most Common Deployment

More information

IBM MQ Appliance HA and DR Performance Report Version July 2016

IBM MQ Appliance HA and DR Performance Report Version July 2016 IBM MQ Appliance HA and DR Performance Report Version 2. - July 216 Sam Massey IBM MQ Performance IBM UK Laboratories Hursley Park Winchester Hampshire 1 Notices Please take Note! Before using this report,

More information

An Oracle White Paper May Oracle VM 3: Overview of Disaster Recovery Solutions

An Oracle White Paper May Oracle VM 3: Overview of Disaster Recovery Solutions An Oracle White Paper May 2014 Oracle VM 3: Overview of Disaster Recovery Solutions Contents Introduction... 1 Overview of DR Solutions with Oracle VM... 2 Choose your DR solution path... 2 Continuous

More information

Veritas Storage Foundation for Oracle RAC from Symantec

Veritas Storage Foundation for Oracle RAC from Symantec Veritas Storage Foundation for Oracle RAC from Symantec Manageability, performance and availability for Oracle RAC databases Data Sheet: Storage Management Overviewview offers a proven solution to help

More information

Data Sheet: High Availability Veritas Cluster Server from Symantec Reduce Application Downtime

Data Sheet: High Availability Veritas Cluster Server from Symantec Reduce Application Downtime Reduce Application Downtime Overview is an industry-leading high availability solution for reducing both planned and unplanned downtime. By monitoring the status of applications and automatically moving

More information

VERITAS Volume Replicator Successful Replication and Disaster Recovery

VERITAS Volume Replicator Successful Replication and Disaster Recovery VERITAS Replicator Successful Replication and Disaster Recovery Introduction Companies today rely to an unprecedented extent on online, frequently accessed, constantly changing data to run their businesses.

More information

EMC VPLEX Geo with Quantum StorNext

EMC VPLEX Geo with Quantum StorNext White Paper Application Enabled Collaboration Abstract The EMC VPLEX Geo storage federation solution, together with Quantum StorNext file system, enables a global clustered File System solution where remote

More information

Broker Clusters. Cluster Models

Broker Clusters. Cluster Models 4 CHAPTER 4 Broker Clusters Cluster Models Message Queue supports the use of broker clusters: groups of brokers working together to provide message delivery services to clients. Clusters enable a Message

More information

ForeScout CounterACT. Resiliency Solutions. CounterACT Version 8.0

ForeScout CounterACT. Resiliency Solutions. CounterACT Version 8.0 ForeScout CounterACT Resiliency Solutions CounterACT Version 8.0 Table of Contents About ForeScout Resiliency Solutions... 4 Comparison of Resiliency Solutions for Appliances... 5 Choosing the Right Solution

More information

ForeScout CounterACT Resiliency Solutions

ForeScout CounterACT Resiliency Solutions ForeScout CounterACT Resiliency Solutions User Guide CounterACT Version 7.0.0 About CounterACT Resiliency Solutions Table of Contents About CounterACT Resiliency Solutions... 5 Comparison of Resiliency

More information

EMC VPLEX with Quantum Stornext

EMC VPLEX with Quantum Stornext White Paper Application Enabled Collaboration Abstract The EMC VPLEX storage federation solution together with Quantum StorNext file system enables a stretched cluster solution where hosts has simultaneous

More information

Chapter 1 CONCEPTS AND FACILITIES. SYS-ED/ Computer Education Techniques, Inc.

Chapter 1 CONCEPTS AND FACILITIES. SYS-ED/ Computer Education Techniques, Inc. Chapter 1 CONCEPTS AND FACILITIES SYS-ED/ Computer Education Techniques, Inc. Objectives You will learn: Objects of MQ. Features and benefits. Purpose of utilities. Architecture of the MQ system. Queue

More information

Oracle E-Business Availability Options. Solution Series for Oracle: 2 of 5

Oracle E-Business Availability Options. Solution Series for Oracle: 2 of 5 Oracle E-Business Availability Options Solution Series for Oracle: 2 of 5 Table of Contents Coping with E-Business Hours Oracle E-Business Availability Options.....1 Understanding Challenges to Availability...........................2

More information

Veritas Cluster Server from Symantec

Veritas Cluster Server from Symantec Delivers high availability and disaster recovery for your critical applications Data Sheet: High Availability Overviewview protects your most important applications from planned and unplanned downtime.

More information

Microsoft Office SharePoint Server 2007

Microsoft Office SharePoint Server 2007 Microsoft Office SharePoint Server 2007 Enabled by EMC Celerra Unified Storage and Microsoft Hyper-V Reference Architecture Copyright 2010 EMC Corporation. All rights reserved. Published May, 2010 EMC

More information

MQ High Availability and Disaster Recovery Implementation scenarios

MQ High Availability and Disaster Recovery Implementation scenarios MQ High Availability and Disaster Recovery Implementation scenarios Sandeep Chellingi Head of Hybrid Cloud Integration Prolifics Agenda MQ Availability Message Availability Service Availability HA vs DR

More information

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery White Paper Business Continuity Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery Table of Contents Executive Summary... 1 Key Facts About

More information

Step-by-Step Guide to Installing Cluster Service

Step-by-Step Guide to Installing Cluster Service Page 1 of 23 TechNet Home > Products & Technologies > Windows 2000 Server > Deploy > Configure Specific Features Step-by-Step Guide to Installing Cluster Service Topics on this Page Introduction Checklists

More information

IBM TS7700 Series Grid Failover Scenarios Version 1.4

IBM TS7700 Series Grid Failover Scenarios Version 1.4 July 2016 IBM TS7700 Series Grid Failover Scenarios Version 1.4 TS7700 Development Team Katsuyoshi Katori Kohichi Masuda Takeshi Nohta Tokyo Lab, Japan System and Technology Lab Copyright 2006, 2013-2016

More information

Step into the future. HP Storage Summit Converged storage for the next era of IT

Step into the future. HP Storage Summit Converged storage for the next era of IT HP Storage Summit 2013 Step into the future Converged storage for the next era of IT 1 HP Storage Summit 2013 Step into the future Converged storage for the next era of IT Karen van Warmerdam HP XP Product

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

IBM GDPS V3.3: Improving disaster recovery capabilities to help ensure a highly available, resilient business environment

IBM GDPS V3.3: Improving disaster recovery capabilities to help ensure a highly available, resilient business environment Marketing Announcement February 14, 2006 IBM GDPS V3.3: Improving disaster recovery capabilities to help ensure a highly available, resilient business environment Overview GDPS is IBM s premier continuous

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM IBM Spectrum Protect Version 8.1.2 Introduction to Data Protection Solutions IBM IBM Spectrum Protect Version 8.1.2 Introduction to Data Protection Solutions IBM Note: Before you use this information

More information

High Availability and Disaster Recovery Solutions for Perforce

High Availability and Disaster Recovery Solutions for Perforce High Availability and Disaster Recovery Solutions for Perforce This paper provides strategies for achieving high Perforce server availability and minimizing data loss in the event of a disaster. Perforce

More information

MQ Parallel Sysplex Exploitation, Getting the Best Availability from MQ on z/os by using Shared Queues

MQ Parallel Sysplex Exploitation, Getting the Best Availability from MQ on z/os by using Shared Queues MQ Parallel Sysplex Exploitation, Getting the Best Availability from MQ on z/os by using Shared Queues Dirk Marski dirk.marski@uk.ibm.com WebSphere MQ for z/os IBM Hursley March 13 th, 2014 Session 15015

More information

Equitrac Office and Express DCE High Availability White Paper

Equitrac Office and Express DCE High Availability White Paper Office and Express DCE High Availability White Paper 2 Summary............................................................... 3 Introduction............................................................

More information

High Availability Options for SAP Using IBM PowerHA SystemMirror for i

High Availability Options for SAP Using IBM PowerHA SystemMirror for i High Availability Options for SAP Using IBM PowerHA Mirror for i Lilo Bucknell Jenny Dervin Luis BL Gonzalez-Suarez Eric Kass June 12, 2012 High Availability Options for SAP Using IBM PowerHA Mirror for

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

Overview. CPS Architecture Overview. Operations, Administration and Management (OAM) CPS Architecture Overview, page 1 Geographic Redundancy, page 5

Overview. CPS Architecture Overview. Operations, Administration and Management (OAM) CPS Architecture Overview, page 1 Geographic Redundancy, page 5 CPS Architecture, page 1 Geographic Redundancy, page 5 CPS Architecture The Cisco Policy Suite (CPS) solution utilizes a three-tier virtual architecture for scalability, system resilience, and robustness

More information

Avoiding the Cost of Confusion: SQL Server Failover Cluster Instances versus Basic Availability Group on Standard Edition

Avoiding the Cost of Confusion: SQL Server Failover Cluster Instances versus Basic Availability Group on Standard Edition One Stop Virtualization Shop Avoiding the Cost of Confusion: SQL Server Failover Cluster Instances versus Basic Availability Group on Standard Edition Written by Edwin M Sarmiento, a Microsoft Data Platform

More information

The Right Choice for DR: Data Guard, Stretch Clusters, or Remote Mirroring. Ashish Ray Group Product Manager Oracle Corporation

The Right Choice for DR: Data Guard, Stretch Clusters, or Remote Mirroring. Ashish Ray Group Product Manager Oracle Corporation The Right Choice for DR: Data Guard, Stretch Clusters, or Remote Mirroring Ashish Ray Group Product Manager Oracle Corporation Causes of Downtime Unplanned Downtime Planned Downtime System Failures Data

More information

Documentation Accessibility. Access to Oracle Support

Documentation Accessibility. Access to Oracle Support Oracle NoSQL Database Availability and Failover Release 18.3 E88250-04 October 2018 Documentation Accessibility For information about Oracle's commitment to accessibility, visit the Oracle Accessibility

More information

MarkLogic Server. Scalability, Availability, and Failover Guide. MarkLogic 9 May, Copyright 2018 MarkLogic Corporation. All rights reserved.

MarkLogic Server. Scalability, Availability, and Failover Guide. MarkLogic 9 May, Copyright 2018 MarkLogic Corporation. All rights reserved. Scalability, Availability, and Failover Guide 1 MarkLogic 9 May, 2017 Last Revised: 9.0-4, January, 2018 Copyright 2018 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents Scalability,

More information

Power Systems High Availability & Disaster Recovery

Power Systems High Availability & Disaster Recovery Power Systems High Availability & Disaster Recovery Solutions Comparison of various HA & DR solutions for Power Systems Authors: Carl Burnett, Joe Cropper, Ravi Shankar Table of Contents 1 Abstract...

More information

: Assessment: IBM WebSphere MQ V7.0, Solution Design

: Assessment: IBM WebSphere MQ V7.0, Solution Design Exam : A2180-376 Title : Assessment: IBM WebSphere MQ V7.0, Solution Design Version : Demo 1. Which new feature in WebSphere MQ V7.0 needs to be taken into account when WebSphere MQ solutions are deployed

More information

Presented By Chad Dimatulac Principal Database Architect United Airlines October 24, 2011

Presented By Chad Dimatulac Principal Database Architect United Airlines October 24, 2011 Presented By Chad Dimatulac Principal Database Architect United Airlines October 24, 2011 How much are the losses of a potential business when a downtime occurs during a planned maintenance and unexpected

More information

OL Connect Backup licenses

OL Connect Backup licenses OL Connect Backup licenses Contents 2 Introduction 3 What you need to know about application downtime 5 What are my options? 5 Reinstall, reactivate, and rebuild 5 Create a Virtual Machine 5 Run two servers

More information

Replication is the process of creating an

Replication is the process of creating an Chapter 13 Local tion tion is the process of creating an exact copy of data. Creating one or more replicas of the production data is one of the ways to provide Business Continuity (BC). These replicas

More information

BUSINESS CONTINUITY: THE PROFIT SCENARIO

BUSINESS CONTINUITY: THE PROFIT SCENARIO WHITE PAPER BUSINESS CONTINUITY: THE PROFIT SCENARIO THE BENEFITS OF A COMPREHENSIVE BUSINESS CONTINUITY STRATEGY FOR INCREASED OPPORTUNITY Organizational data is the DNA of a business it makes your operation

More information

High Availability and Disaster Recovery features in Microsoft Exchange Server 2007 SP1

High Availability and Disaster Recovery features in Microsoft Exchange Server 2007 SP1 High Availability and Disaster Recovery features in Microsoft Exchange Server 2007 SP1 Product Group - Enterprise Dell White Paper By Farrukh Noman Ananda Sankaran April 2008 Contents Introduction... 3

More information

WHITE PAPER. Header Title. Side Bar Copy. Header Title 5 Reasons to Consider Disaster Recovery as a Service for IBM i WHITEPAPER

WHITE PAPER. Header Title. Side Bar Copy. Header Title 5 Reasons to Consider Disaster Recovery as a Service for IBM i WHITEPAPER Side Bar Copy Header Title Header Title 5 Reasons to Consider Disaster Recovery as a Service for IBM i WHITEPAPER Introduction Due to the complexity of protecting ever-changing infrastructures and the

More information

BEAWebLogic. Server. Automatic and Manual Service-level Migration

BEAWebLogic. Server. Automatic and Manual Service-level Migration BEAWebLogic Server Automatic and Manual Service-level Migration Version 10.3 Technical Preview Revised: March 2007 Service-Level Migration New in WebLogic Server 10.3: Automatic Migration of Messaging/JMS-Related

More information

CLUSTERING. What is Clustering?

CLUSTERING. What is Clustering? What is Clustering? CLUSTERING A cluster is a group of independent computer systems, referred to as nodes, working together as a unified computing resource. A cluster provides a single name for clients

More information

VERITAS Dynamic MultiPathing (DMP) Increasing the Availability and Performance of the Data Path

VERITAS Dynamic MultiPathing (DMP) Increasing the Availability and Performance of the Data Path White Paper VERITAS Storage Foundation for Windows VERITAS Dynamic MultiPathing (DMP) Increasing the Availability and Performance of the Data Path 12/6/2004 1 Introduction...3 Dynamic MultiPathing (DMP)...3

More information

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM IBM Tivoli Storage Manager Version 7.1.6 Introduction to Data Protection Solutions IBM IBM Tivoli Storage Manager Version 7.1.6 Introduction to Data Protection Solutions IBM Note: Before you use this

More information

VERITAS Dynamic Multipathing. Increasing the Availability and Performance of the Data Path

VERITAS Dynamic Multipathing. Increasing the Availability and Performance of the Data Path VERITAS Dynamic Multipathing Increasing the Availability and Performance of the Data Path 1 TABLE OF CONTENTS I/O Path Availability and Performance... 3 Dynamic Multipathing... 3 VERITAS Storage Foundation

More information

Building a 24x7 Database. By Eyal Aronoff

Building a 24x7 Database. By Eyal Aronoff Building a 24x7 Database By Eyal Aronoff Contents Building a 24 X 7 Database... 3 The Risk of Downtime... 3 Your Definition of 24x7... 3 Performance s Impact on Availability... 4 Redundancy is the Key

More information

Three Steps Toward Zero Downtime. Guide. Solution Guide Server.

Three Steps Toward Zero Downtime. Guide. Solution Guide Server. Three Steps Toward Zero Downtime Guide Solution Guide Server Server Solution Guide Three Steps Toward Zero Downtime Introduction Service uptime is a top priority for many business operations. From global

More information

IBM TotalStorage Enterprise Storage Server (ESS) Model 750

IBM TotalStorage Enterprise Storage Server (ESS) Model 750 A resilient enterprise disk storage system at midrange prices IBM TotalStorage Enterprise Storage Server (ESS) Model 750 Conducting business in the on demand era demands fast, reliable access to information

More information

ExpressCluster X 1.0 for Windows

ExpressCluster X 1.0 for Windows ExpressCluster X 1.0 for Windows Getting Started Guide 6/22/2007 Third Edition Revision History Edition Revised Date Description First 09/08/2006 New manual Second 12/28/2006 Reflected the logo change

More information

A GPFS Primer October 2005

A GPFS Primer October 2005 A Primer October 2005 Overview This paper describes (General Parallel File System) Version 2, Release 3 for AIX 5L and Linux. It provides an overview of key concepts which should be understood by those

More information

DISK LIBRARY FOR MAINFRAME

DISK LIBRARY FOR MAINFRAME DISK LIBRARY FOR MAINFRAME Geographically Dispersed Disaster Restart Tape ABSTRACT Disk Library for mainframe is Dell EMC s industry leading virtual tape library for IBM zseries mainframes. Geographically

More information

Advanced Architectures for Oracle Database on Amazon EC2

Advanced Architectures for Oracle Database on Amazon EC2 Advanced Architectures for Oracle Database on Amazon EC2 Abdul Sathar Sait Jinyoung Jung Amazon Web Services November 2014 Last update: April 2016 Contents Abstract 2 Introduction 3 Oracle Database Editions

More information

Exam : S Title : Snia Storage Network Management/Administration. Version : Demo

Exam : S Title : Snia Storage Network Management/Administration. Version : Demo Exam : S10-200 Title : Snia Storage Network Management/Administration Version : Demo 1. A SAN architect is asked to implement an infrastructure for a production and a test environment using Fibre Channel

More information

WebSphere MQ Queue Sharing Group in a Parallel Sysplex environment

WebSphere MQ Queue Sharing Group in a Parallel Sysplex environment Draft Document for Review January 14, 2004 11:55 am 3636paper.fm Redbooks Paper Mayur Raja Amardeep Bhattal Pete Siddall Edited by Franck Injey WebSphere MQ Queue Sharing Group in a Parallel Sysplex environment

More information

IBM MQ Appliance Performance Report Version June 2015

IBM MQ Appliance Performance Report Version June 2015 IBM MQ Appliance Performance Report Version 1. - June 215 Sam Massey IBM MQ Performance IBM UK Laboratories Hursley Park Winchester Hampshire 1 Notices Please take Note! Before using this report, please

More information

Virtual Disaster Recovery

Virtual Disaster Recovery The Essentials Series: Managing Workloads in a Virtual Environment Virtual Disaster Recovery sponsored by by Jaime Halscott Vir tual Disaster Recovery... 1 Virtual Versus Physical Disaster Recovery...

More information

WebSphere MQ and OpenVMS Failover Sets

WebSphere MQ and OpenVMS Failover Sets OpenVMS Technical Journal V11 WebSphere MQ and OpenVMS Sets John Edelmann, Technology Consultant WebSphere MQ and OpenVMS Sets... 1 Overview... 2 WebSphere MQ Clusters... 2 OpenVMS Clusters and WebSphere

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

IBM Software Group. IBM WebSphere MQ V7.0. Introduction and Technical Overview. An IBM Proof of Technology IBM Corporation

IBM Software Group. IBM WebSphere MQ V7.0. Introduction and Technical Overview. An IBM Proof of Technology IBM Corporation IBM Software Group IBM WebSphere MQ V7.0 Introduction and Technical Overview An IBM Proof of Technology 2008 IBM Corporation Unit Agenda Why is Messaging Important to the Enterprise? What is WebSphere

More information

Important Announcement: Substantial Upcoming Enhancement to Mirroring. Change Required for Sites Currently Using IsOtherNodeDown^ZMIRROR

Important Announcement: Substantial Upcoming Enhancement to Mirroring. Change Required for Sites Currently Using IsOtherNodeDown^ZMIRROR One Memorial Drive, Cambridge, MA 02142, USA Tel: +1.617.621.0600 Fax: +1.617.494.1631 http://www.intersystems.com January 30, 2014 Important Announcement: Substantial Upcoming Enhancement to Mirroring

More information

Contingency Planning and Disaster Recovery

Contingency Planning and Disaster Recovery Contingency Planning and Disaster Recovery Best Practices Version: 7.2.x Written by: Product Knowledge, R&D Date: April 2017 2017 Lexmark. All rights reserved. Lexmark is a trademark of Lexmark International

More information

Microsoft SQL Server

Microsoft SQL Server Microsoft SQL Server Abstract This white paper outlines the best practices for Microsoft SQL Server Failover Cluster Instance data protection with Cohesity DataPlatform. December 2017 Table of Contents

More information

Database Architectures

Database Architectures Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 11/15/12 Agenda Check-in Centralized and Client-Server Models Parallelism Distributed Databases Homework 6 Check-in

More information

The Microsoft Large Mailbox Vision

The Microsoft Large Mailbox Vision WHITE PAPER The Microsoft Large Mailbox Vision Giving users large mailboxes without breaking your budget Introduction Giving your users the ability to store more email has many advantages. Large mailboxes

More information

HA-AP Hardware Appliance

HA-AP Hardware Appliance HA-AP Hardware Appliance Solution Whitepaper: High Availability SAN Appliance - Guarantees Data Access and Complete Transparency November 2013 Loxoll Inc. California U.S.A. Protection against workflow

More information

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Java Message Service (JMS) is a standardized messaging interface that has become a pervasive part of the IT landscape

More information

TECHNICAL ADDENDUM 01

TECHNICAL ADDENDUM 01 TECHNICAL ADDENDUM 01 What Does An HA Environment Look Like? An HA environment will have a Source system that the database changes will be captured on and generate local journal entries. The journal entries

More information

EMC CLARiiON CX3-40. Reference Architecture. Enterprise Solutions for Microsoft Exchange 2007

EMC CLARiiON CX3-40. Reference Architecture. Enterprise Solutions for Microsoft Exchange 2007 Enterprise Solutions for Microsoft Exchange 2007 EMC CLARiiON CX3-40 Metropolitan Exchange Recovery (MER) for Exchange Server Enabled by MirrorView/S and Replication Manager Reference Architecture EMC

More information

Symantec Storage Foundation for Oracle Real Application Clusters (RAC)

Symantec Storage Foundation for Oracle Real Application Clusters (RAC) Symantec Storage Foundation for Oracle Real Application Clusters () Manageability and availability for Oracle databases Data Sheet: Storage Management Over Overview view Key Benefits SymantecTM Storage

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

3.1. Storage. Direct Attached Storage (DAS)

3.1. Storage. Direct Attached Storage (DAS) 3.1. Storage Data storage and access is a primary function of a network and selection of the right storage strategy is critical. The following table describes the options for server and network storage.

More information

HIGH AVAILABILITY STRATEGIES

HIGH AVAILABILITY STRATEGIES An InterSystems Technology Guide One Memorial Drive, Cambridge, MA 02142, USA Tel: +1.617.621.0600 Fax: +1.617.494.1631 http://www.intersystems.com HIGH AVAILABILITY STRATEGIES HA Strategies for InterSystems

More information

INTRODUCING VERITAS BACKUP EXEC SUITE

INTRODUCING VERITAS BACKUP EXEC SUITE INTRODUCING VERITAS BACKUP EXEC SUITE January 6, 2005 VERITAS ARCHITECT NETWORK TABLE OF CONTENTS Managing More Storage with Fewer Resources...3 VERITAS Backup Exec Suite...3 Continuous Data Protection...

More information

Protecting remote site data SvSAN clustering - failure scenarios

Protecting remote site data SvSAN clustering - failure scenarios White paper Protecting remote site data SvSN clustering - failure scenarios Service availability and data integrity are key metrics for enterprises that run business critical applications at multiple remote

More information

White Paper. How to select a cloud disaster recovery method that meets your requirements.

White Paper. How to select a cloud disaster recovery method that meets your requirements. How to select a cloud disaster recovery method that meets your requirements. VS Table of contents Table of contents Page 2 Executive Summary Page 3 Introduction Page 3 Disaster Recovery Methodologies Page

More information

Introduction and Technical Overview

Introduction and Technical Overview IBM Software Group IBM WebSphere MQ V7.0 Introduction and Technical Overview An IBM Proof of Technology 2008 IBM Corporation Unit Agenda Why is Messaging Important to the Enterprise? What is WebSphere

More information

IBM TS7700 grid solutions for business continuity

IBM TS7700 grid solutions for business continuity IBM grid solutions for business continuity Enhance data protection and business continuity for mainframe environments in the cloud era Highlights Help ensure business continuity with advanced features

More information