EMC Business Continuity for Microsoft Exchange PDF Free Download

EMC Business Continuity for Microsoft Exchange 2010 Enabled by EMC Unified Storage and Microsoft Database Availability Groups Proven Solution Guide

Copyright 2011 EMC Corporation. All rights reserved. Published January, 2011 EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. All other trademarks used herein are the property of their respective owners. Part number: h8154 2

Table of Contents Chapter 1: About this Document... 4 Overview... 4 Audience and purpose... 5 Scope... 5 Reference architecture... 6 Prerequisites and supporting documentation... 7 Terminology... 7 Chapter 2: Application Design... 8 Overview... 8 Microsoft Exchange Server 2010... 9 Replication Manager... 11 Best practices and recommendations... 11 Chapter 3: Virtualization... 12 Overview... 12 Concepts... 13 Advantages of virtualization... 13 Considerations... 13 Implementation... 14 Virtualization best practices... 15 Chapter 4: Network Design... 16 Overview... 16 Considerations... 17 Implementation... 17 Best practices... 18 Chapter 5: Storage Design... 19 Overview... 19 Design considerations... 20 Storage design implementation... 21 Best practices... 22 Chapter 6: Testing and Validation... 23 Overview... 23 Test details... 24 Testing overview... 24 Testing tools... 24 Gating metrics... 25 Methodology... 26 Result analysis... 28 Overview... 28 Comparison between RAID 5, RAID 6, and RAID 10... 28 Scale up test results with RAID 5 (4+1) configuration... 31 Resiliency scenarios... 33 Exchange 2010 DAG test... 38 3

Chapter 1: About this Document Overview Introduction EMC's commitment to consistently maintain and improve quality is led by the Total Customer Experience (TCE) program, which is driven by Six Sigma methodologies. As a result, EMC has built Customer Integration Labs in its Global Solutions Centers to reflect realworld deployments in which TCE use cases are developed and executed. These use cases provide EMC with an insight into the challenges currently facing its customers. This Proven Solution Guide summarizes a series of best practices that were discovered or validated during the testing of the EMC Business Continuity for Microsoft Exchange 2010 Enabled by EMC Unified Storage and Microsoft Database Availability Groups solution that uses the following products: EMC unified storage Microsoft Exchange Server 2010 Microsoft Windows Server 2008 R2 Hyper-V EMC Replication Manager 5.3 EMC PowerPath This solution was implemented in a virtualized environment to leverage the benefits of virtualization and to consolidate the hardware infrastructure of the customer. Use case definition A use case reflects a defined set of tests that validates the reference architecture for a customer environment. This validated architecture can then be used as a reference point for a proven solution. Contents This chapter contains the following topics: Topic Audience and purpose 5 Scope 5 Reference Architecture 6 Prerequisites and supporting documentation 7 Terminology 7 See Page 4

Audience and purpose Audience The intended audience for this Proven Solution Guide is: Customers EMC partners Internal EMC personnel Purpose The purpose of this use case is to provide a virtualized solution using Microsoft Hyper-V for Microsoft Exchange Server 2010. The solution includes all the attributes required to run this environment, that is, hardware and software, including Active Directory, and the required Exchange Server roles. This solution also uses EMC Replication Manager and EMC SnapView for backup. Information in this document can be used as the basis for a solution build, white paper, best practices document, or training. It can also be used by other EMC organizations (for example, the technical services or sales organization) as the basis for producing documentation for a technical services or sales kit. Scope Scope This document contains the results of testing Microsoft Exchange Server 2010 with Database Availability Group (DAG) in a Microsoft Hyper-V virtual environment on EMC unified storage. The objectives of this testing are as follows: Establish a reference architecture of validated hardware and software that permits easy and repeatable deployment of Microsoft Exchange 2010 on virtual machines using EMC unified storage. Establish the storage best practices to configure Microsoft Exchange Server 2010 with DAG on virtual machines using EMC unified storage in a manner that provides optimal performance, recoverability, and protection, all in the context of the midtier enterprise market. The following use cases were tested to establish this reference architecture: Examine potential building block configurations on SATA and select a suitable building block. Scale up the building block to decide the maximum user count supported in the environment. Determine how the system responds to recoverable failure conditions by conducting resiliency tests. Validate the use of DAG replication with EMC CLARiiON CX4-120 for disaster recovery in Microsoft Exchange 2010. Examine the performance impact of WAN latency on DAG replication. Examine the backup of Exchange data with EMC unified storage snapshots by using Replication Manager. Not in scope Implementation instructions and sizing guidelines are beyond the scope of this document. Information on how to install and configure Microsoft Exchange Server 2010 and the required EMC products is out of scope for this document. However, links are provided on where to find all required software for this solution. 5

Reference architecture Corresponding reference architecture This use case has a corresponding Reference Architecture document that is available on EMC Powerlink and EMC.com. EMC Business Continuity for Microsoft Exchange 2010 Enabled by EMC Unified Storage and Microsoft Database Availability Groups Reference Architecture provides more details. If you do not have access to the following content, contact your EMC representative. Reference architecture diagram The following diagram depicts the overall logical architecture of the use case. This solution can also be built on a physical environment where the functionality is identical to a virtualized Exchange 2010 DAG environment. 6

Prerequisites and supporting documentation Technology It is assumed that the reader has a general knowledge of the following products: Microsoft Windows Server 2008 R2 Hyper-V Microsoft Exchange Server 2010 EMC unified storage EMC Replication Manager 5.3 EMC PowerPath Supporting documents The following documents, located on Powerlink.com, provide additional, relevant information. Access to these documents is based on your login credentials. If you do not have access to the following content, contact your EMC representative. EMC Business Continuity for Microsoft Exchange 2010 Enabled by EMC Unified Storage and Microsoft Database Availability Groups Reference Architecture Using EMC CLARiiON with Microsoft Hyper-V Server Applied Technology white paper Deployment Guidelines for Microsoft Exchange 2010 with EMC Unified Storage Best Practices Planning white paper Terminology Introduction This section defines the terms used in this document. Term Exchange 2010 Database Availability Group (DAG) Internet SCSI (iscsi) Microsoft Exchange Server 2010 Redundant Array of Inexpensive Disks (RAID) Definition A DAG is a set of up to 16 Microsoft Exchange Server 2010 mailbox servers that provide automatic databaselevel recovery from a database, server, or network failure. A protocol to send SCSI packets over TCP/IP networks. A unified messaging solution from Microsoft Corporation that is the target application for this testing. A method used to store information where the data is stored on multiple disk drives to increase performance and storage capacities and to provide redundancy and fault tolerance. 7

Chapter 2: Application Design Overview Introduction The primary application in this solution is Microsoft Exchange Server 2010. In addition, the solution uses the following supporting applications: EMC Replication Manager EMC PowerPath Designing and implementing the layout for any environment is critical. Correcting layout errors can be expensive and time-consuming. Therefore, doing it right the first time should be the primary goal. This chapter explains the design process employed in building the Exchange 2010 DAG infrastructure for this use case. The information provided here can be used as a starting point to design and implement a similar environment. This document does not provide planning details and architecture guidelines for an Exchange 2010 DAG environment. The following link provides more information about Exchange 2010 DAG planning and architecture: http://technet.microsoft.com/en-us/library/dd979781.aspx Scope The application design layout instructions presented in this chapter apply to the specific components used during the development of this solution. Contents This chapter contains the following topics: Topic Microsoft Exchange Server 2010 9 Replication Manager 5.3 11 Best practices and recommendations 11 See Page 8

Microsoft Exchange Server 2010 Considerations Before any of the use cases can be realized, Exchange 2010 DAG must be set up to create the database copies. Proper design is often overlooked in the rush to attain a replicated state for regulatory or business reasons. But a well-planned setup is critical to successfully address the use cases for the database copies. When building a Microsoft Exchange Server 2010 environment, the following items must be considered: Architectural changes Microsoft has introduced DAG in place of Local Continuous Replication (LCR), Cluster Continuous Replication (CCR), and Standby Continuous Replication (SCR). The following link provides more information: http://technet.microsoft.com/en-us/library/dd335211(exchg.140).aspx Increase in checkpoint depth For DAG configurations, Microsoft has increased the checkpoint depth to 100 MB per database. Therefore, the I/O requirement for a user is reduced further. The following link provides more information: http://technet.microsoft.com/en-us/library/ee832793.aspx Workload of the target application When designing a realworld solution, it is crucial to understand the workload of the target application and how that relates to the industry-standard workload presented in this solution. Exchange Server design Exchange Server 2010 requires different server roles to be installed on different physical or virtual servers. Organizations may want multiple servers for the same role. When designing this solution, create a typical environment that applies to most requirements and is scalable in terms of performance. Database maintenance Databases require regular maintenance for optimal performance. As new mailboxes are added to the database and old mailboxes are deleted, the Exchange database must be administered and maintained properly. Regular maintenance can prevent the natural growth of a database from having a detrimental impact on the planned performance of the system. 9

Implementation In this use case, the Exchange 2010 environment was designed for high capacity with optimized performance, reduced bottlenecks, and ease of manageability. The Exchange 2010 environment consisted of: Three Exchange 2010 mailbox servers for the DAG configuration. Three Exchange 2010 servers with HUB Transport and CAS roles. Microsoft states that in a DAG environment, there must be at least two database copies if the disk has RAID protection. EMC recommends supporting Microsoft s stand of having two copies locally and also recommends having one remote copy for disaster recovery when Exchange 2010 is deployed in an EMC unified storage environment with RAID protection. This solution consisted of an Exchange 2010 DAG implementation with two local database copies on the primary CLARiiON storage array. The production site mailbox servers worked as the active copy of the databases. The active databases were evenly distributed between the two production site mailbox servers. The single remote copy of the Exchange database, which was on the disaster recovery site in a secondary CLARiiON storage array, was used for disaster recovery. All the roles in the Exchange 2010 environment were deployed on Hyper-V virtual machines that were hosted on physical machines with Windows 2008 R2 installed. The primary hypervisor hosted the virtual machines for the first Exchange mailbox server and one HUB/CAS server. The secondary hypervisor hosted the virtual machines for the second HUB/CAS server, and the second mailbox server. The third hypervisor hosted the virtual machines for the second domain controller, second HUB/CAS, and the third Exchange mailbox server. The primary Active Directory and the Replication Manager server were hosted on a physical machine. To simulate a remote disaster recovery site, a WAN emulator was used with a suitable latency and bandwidth. The solution also studied the impact of the WAN latencies over the DAG implementation. The following table describes the individual server roles in the Exchange 2010 environment created for this use case. Server role Configuration details Exchange 2010 A virtual machine deployed on a Windows 2008 R2 Hyper-V mailbox server physical server. Windows Server 2008 R2 installed on the virtual machine. Exchange Server 2010 mailbox server role installed on the virtual machine. Two mailbox servers on the primary site and one on the disaster recovery site. EMC PowerPath configured on these servers for high availability and load balancing in the iscsi storage Exchange 2010 HUB Transport and CAS connections. A virtual machine deployed on a Windows 2008 R2 Hyper-V physical server. Windows Server 2008 R2 installed on the virtual machine. Exchange Server 2010 HUB Transport and CAS roles installed on the virtual machine. One server on the primary site and one on the disaster recovery site. 10

Further, this solution also ensures that a backup of the Exchange databases is taken in the form of point-in-time copies by using EMC SnapView snapshots with EMC Replication Manager. The impact of the backup on the production system was studied. Replication Manager Considerations EMC Replication Manager is a replica automation software that creates point-in-time replicas of databases and file systems that reside on CLARiiON, Symmetrix, or Celerra storage systems. Replicas can be used for repurposing, backup and recovery, or disaster recovery. Replication Manager provides a single interface to manage the local and remote replicas across supported storage systems. Replication Manager has good application integration with Microsoft Exchange Server 2010. It discovers and maps databases on the Exchange Server 2010 DAG or a standalone environment to the underlying storage infrastructure. In this solution, Replication Manager is used along with CLARiiON SnapView snapshot technology to create point-in-time, application-aware snapshots of Exchange Server 2010 databases. Replication Manager supports taking snapshots from both active and passive databases in a DAG environment. While designing the environment, this also needs to be considered. The snapshots taken by Replication Manager must be mounted to take the backup. When designing the Replication Manager, decide the server that will act as a mount point for the snapshots. Implementation The Replication Manager server was installed on a virtual machine with Microsoft Windows Server 2008 R2 Enterprise edition. The Replication Manager Exchange 2010 agent was installed on the mailbox servers in the production site of the DAG. Replication Manager was used to take the snapshot of the passive Exchange database and the log LUNs and mount them on the same server. Best practices and recommendations The Deployment Guidelines for Microsoft Exchange 2010 with EMC Unified Storage Best Practices Planning white paper available on Powerlink provides a list of highavailability best practices. 11

Chapter 3: Virtualization Overview Introduction This chapter provides procedures and guidelines to install and configure the virtualization components that make up the validated solution. Scope The virtualization guidelines presented in this chapter apply to the specific components used during the development of this solution. Contents This chapter contains the following topics: Topic Concepts 13 Advantages of virtualization 13 Considerations 13 Implementation 14 Virtualization best practices 15 See Page 12

Concepts Virtualization layer The virtualization layer abstracts the processor, memory, storage, and network resources of a physical server to multiple virtual machines. This allows multiple operating systems to run simultaneously and independently on a single physical server. Hyper-V Hyper-V, the virtualization technology in Microsoft Windows Server 2008 R2, aims at reducing hardware investments by creating multiple virtual machines on a single physical machine. Hyper-V is a hypervisor-based virtualization system for x86-64 systems. Advantages of virtualization Reduced costs One of the main challenges faced by the customer is to reduce costs by using infrastructure effectively. Virtualization leads to a reduction in the number of servers and related IT hardware in the data center. Reduced downtime A running virtual machine production database can be moved from one physical server to another with reduced downtime. Superior performance and scalability In a scale-out context, virtualization can provide superior performance and scalability when compared to physically booted configurations, even when identical hardware is used. Ease of use The single user interface allows administrators to manage and monitor multiple virtual machines from one console. Therefore, virtualization enables administrators to manage virtual machines more easily and conveniently than physical servers. Considerations Hosting multiple virtual servers It is important to consider all aspects of the solution before virtualizing the environment. When hosting multiple virtual servers on a single physical server, consider the following: Do not host all high-priority virtual servers on a single physical server. Do not host processor- and memory-intensive virtual machines on a single physical server. Do not overprovision processor and memory to high-priority virtual servers. 13

Implementation Implementation of Hyper-V This solution is virtualized by using Microsoft Windows Server 2008 R2 Hyper-V. This solution uses three physical servers where Windows Server 2008 R2 is installed. Two of the Hyper-V servers constitute the primary site and one constitutes the disaster recovery site. This section provides the details of the virtualization design of this solution. The physical characteristics of the three physical servers are: Four 3 GHz Intel Xeon processors 32 GB of memory One 73 GB 15k internal SCSI disk Two on-board 10/100/1000 MB Ethernet NICs Four additional 10/100/1000 MB Ethernet NICs The following table provides the configuration of the Hyper-V systems. Make and Model Dell PowerEdge 2950 Hyper-V server Hyper-V 1 Dell PowerEdge 2950 Hyper-V 2 Dell PowerEdge 2950 Hyper-V 3 Virtual machines Exchange mailbox server Replication Manager Exchange mailbox server Exchange HUB/CAS Exchange mailbox server Exchange HUB/CAS Active Directory and global catalog server Virtual machine configuration Four vcpus 16 GB of memory Four virtual NICs Two vcpus 4 GB of memory One virtual NIC Four vcpus 16 GB of memory Four virtual NICs Two vcpus 4 GB of memory One virtual NIC Four vcpus 16 GB of memory Four virtual NICs Two vcpus 4 GB of memory One virtual NIC Two vcpus 4 GB of memory One virtual NIC This hardware configuration was based on recommendations from Microsoft on the different Exchange Server 2010 roles and considering the optimum performance and cost. The following links provide more details about selecting the hardware configuration for the various Exchange Server 2010 roles: http://technet.microsoft.com/en-us/library/aa996719.aspx http://technet.microsoft.com/en-us/library/dd346700.aspx 14

Virtualization best practices The Deployment Guidelines for Microsoft Exchange 2010 with EMC Unified Storage Best Practices Planning white paper available on Powerlink provides a list of virtualization best practices with Exchange. 15

Chapter 4: Network Design Overview Introduction This chapter describes the network architecture of CLARiiON CX4-120 on the Exchange 2010 DAG with Microsoft Hyper-V solution. Scope System-wide network design and architecture are outside the scope of this solution. This chapter presents network design recommendations that are consistent with industry-accepted best practices and are compatible with the existing network infrastructure and policies. Contents This chapter contains the following topics: Topic Considerations 17 Implementation 17 Best practices 18 See Page 16

Considerations Physical design considerations EMC recommends that switches support gigabit Ethernet (GbE) connections and that the ports on the switches support copper-based media. EMC also recommends to use the switches that support the fastest and most reliable connection methodologies that are currently available. At present, these include Copper Gigabit Ethernet for client networks and IP storage networks, and 4 Gb optical dedicated Fibre Channel for storage networks. To ensure uninterrupted communication between systems and storage in the environment, plan the networks for high availability. This includes having redundant switches and paths as well as redundant network interface cards (NICs) or NIC ports. EMC PowerPath can also be used for high availability and load balancing. Logical design considerations This validated solution uses virtual LANs (VLANs) to segregate network traffic of different types to improve throughput, manageability, application separation, high availability, and security. The four VLANs used in this solution are: A client VLAN network that supports the connectivity between the application server and the client workstations. The client VLAN also supports the connectivity between EMC unified storage and the client workstations to provide network file services to the clients. A storage VLAN that uses the iscsi protocol to provide connectivity between the servers and the storage. Each server is connected to the storage VLAN and has at least one NIC dedicated to the storage traffic. A management VLAN that supports the connectivity to the virtual servers for server administration and to the EMC unified storage system for storage management. A replication VLAN for the Exchange DAG replication between the Exchange mailbox servers. Implementation Physical design implementation The CLARiiON CX4-120 contains two storage processors (SP) that can operate independently. Each SP has five I/O slots and two are used for the iscsi connectivity for this solution. The following diagram depicts the I/O module configuration of the CX4-120 SPs. 17

Logical design implementation Two ports in slots A0 and A1 handled the storage traffic on the storage VLAN for the disks owned by SP A. All other ports were left open for future requirements. A similar configuration was used for the disks owned by SP B. EMC PowerPath was used for multipathing and load balancing from the host side. Best practices The Deployment Guidelines for Microsoft Exchange 2010 with EMC Unified Storage Best Practices Planning white paper available on Powerlink provides a list network related best practices. 18

Chapter 5: Storage Design Overview Introduction Storage design is an important element to ensure the successful development of this solution. Scope The storage design layout instructions presented in this chapter apply to the specific components used during the development of this solution. Contents Topic Design considerations 20 Storage design implementation 21 Best practices 22 See Page 19

Design considerations Overview The most common mistake when planning storage is to design only for the storage capacity and not for the performance or I/Os per second (IOPS). To plan an efficient disk layout, calculate the number of IOPS that need to be supported on a sustained basis from the transactional requirements of the user profile, the peak IOPS, and the peak duration. Performance Many customers gather data while the application is running. Thereafter, they use the 90 th percentile to determine the performance level that should be planned for. The following table shows the user profile and the mailbox requirement that was considered for the storage calculation: Number of Users User Profile IOPS 1,500 100 sent/received 0.12 The following four primary variables were used to determine the number of spindles for storage: IOPS (or MB/s if it is a serial workload). Latency goals based on the application requirements. RAID level When planning for performance, striped RAID 1/0 requires fewer spindles than RAID 5 for almost all read/write workloads. The spindle count is approximately equal in a read-only workload. Drive type The drive type can dramatically decrease or increase the number of drives required to satisfy the workload. As a general best practice, database-type applications are hosted on Fibre Channel (FC) drives. However, SATA is becoming a more popular and efficient choice for Exchange 2010 data storage due to the IOPS reductions implemented in the product. The storage configuration was decided after the results of the Jetstress test on various storage configurations such as RAID 6 (4+2), RAID 5 (4+1), and RAID 1/0 (4+4). It was found that RAID 5 (4+1) gave the best support for 500 users with minimum storage configuration for Exchange 2010 DAG. The user calculation was then done by keeping 20 percent overhead and 25 percent growth opportunity. Result analysis on page 28 provides more information. Capacity With major advances in disk technology, the increase in the storage capacity of a disk drive has outpaced the increase in IOPS by almost 1,000:1. As a result, it is rare to find a system that does not meet the storage capacity requirements for the workload. Therefore, the IOPS capacity must be used as a standard when planning storage configurations. The storage capacity (GB) must be considered only after considering the IOPS capacity of a configuration. With Exchange 2010, Microsoft has changed the IOPS requirement and design considerations. Now, the storage capacity of the drives is equally important as the IOPS capacity. With DAG, customers may not need a separate replication technology. Therefore, the performance bottleneck because of the replication is eliminated. The mailbox size requirement has increased and the database LUN size requirement has also increased because of the introduction of DAG and the change in the mailbox schema to accommodate larger mailboxes. So, when designing the 20

storage for Exchange 2010 DAG, both the storage capacity and the IOPS capacity must be given equal importance. NOTE: The storage layout implementation is calculated after the Jetstress test results with various storage configurations. Result analysis on page 28 provides more information on how the building block was decided. Storage design implementation Introduction For this testing, EMC unified storage was used with a RAID 5 (4+1) disk configuration. The building block for the DAG testing was determined by testing various building block configurations such as RAID 5, RAID 6, and RAID 10. An ideal building block was determined based on the performance and storage capacity of the various configurations. Even though Microsoft does not recommend to use RAID 5 with SATA disks, this configuration was selected for this solution because this was a DAG configuration and the RAID rebuild time was not considered a crucial design issue when compared to other important design factors such as cost and performance. The following diagram depicts the overall storage layout of the solution. 21

Performance Taking the performance into consideration, all the LUNs were designed as metaluns. The RAID group was sliced into 4 LUNs, and two MetaLUNs were created by concatenating the faster LUN with the slower LUN to balance the performance. The metaluns created on the single building block were balanced between the two storage processors by ensuring that the first metalun was owned by SP A and the second metalun was owned by SP B. The following diagram shows how the metaluns were created from a single RAID group. The following diagram shows the metalun configuration. Capacity The DAG copies were stored on 1 TB 7.2k rpm SATA drives. A single building block was defined as a RAID 5 (4+1) configuration that can store 500 users. Two databases and logs were created on the single building block. Three building blocks were used for each DAG copy. Two DAG copies were stored on the primary CLARiiON storage array, and the third copy was stored on the secondary CLARiiON storage array. NOTE: Concatenated metalun configuration was the best practice at the time of implementation of this solution. But, the current best practice is to use fully provisioned storage pool LUNs. Compared to metaluns, it is easier to configure the storage especially for larger configuration by using storage pools. Best practices The Deployment Guidelines for Microsoft Exchange 2010 with EMC Unified Storage Best Practices Planning white paper available on Powerlink provides a list of storage best practices. 22

Chapter 6: Testing and Validation Overview Introduction This chapter outlines the test tools, methods, test results, workload used, common setup procedures, and architectural considerations. Contents This chapter contains the following topics: Topic Test details 24 Result analysis 28 See Page 23

Test details Testing overview Tested scenarios The following scenarios were tested: Determine the optimal building block configuration for Exchange 2010. Scale up the selected building block to reach the maximum possible user count. Run resiliency tests on a stand-alone Exchange environment to determine how the loaded system responds to recoverable failure conditions. The tested scenarios include RAID group rebuild, network link failure, and SP reboot. Validate the use of DAG replication with CX4-120 for disaster recovery in Microsoft Exchange 2010 for the following profile: Total users: 1,500 Mailbox size: 1 GB per user I/O profile: 100 sent/received (0.1 IOPS/user) Study the impact of WAN latency on DAG replication. Analyze the impact of snapshot and backup operations on the passive databases during I/O load by using Replication Manager and Windows native backup. Testing tools Introduction To test the Exchange 2010 DAG environment, the following tools designed by Microsoft was used: Microsoft Exchange Server Jetstress 2010 Microsoft Exchange Load Generator (LoadGen) 2010 Microsoft Exchange Jetstress 2010 To verify the performance and stability of the disk subsystem, Microsoft Exchange Jetstress 2010 was used to simulate an Exchange I/O load on a test server before putting the server into the production environment. Microsoft Exchange LoadGen 2010 Microsoft Exchange LoadGen was used to simulate the Active Directory users to send messaging requests to the Exchange servers. This tool was used to measure the impact of MAPI, OWA, IMAP, POP, and SMTP clients on the Exchange servers. LoadGen tests how a server running Exchange responds to various email and messaging loads. LoadGen generates multiple messaging requests to the servers running Exchange, and thereby induces the specified mail load. LoadGen is a useful tool that administrators can use to validate the overall Exchange solution. Specifically, it helps to determine if each of the servers can handle the load that they are intended to manage. The following table provides the mailbox configuration and the load that were used for the test. 24

Mailbox size Sent/received per day (approximately 75 KB message size) Exchange 2010 IOPS DB cache per user 1 GB 100 sent/received 0.10 6 MB Gating metrics Introduction This section explains the gating metrics followed in this solution. Overview Disk latency and remote procedure call (RPC) parameters are the key gating metrics for this test. The following table describes the gating metrics. Parameter Description Gating metrics RPC Averaged Latency is the Must not be higher RPC latency in milliseconds than 25 ms on an averaged for the past 1,024 average. packets. MSExchangeIS\ RPC Averaged Latency MSExchangeIS\ RPC Requests MSExchange Database Instances(*)\I/O Database Reads Average Latency MSExchange Database Instances(*)\I/O Database Writes Average Latency MSExchange Database\I/O Log Writes Average Latency MSExchange Database\I/O Log Reads Average Latency RPC requests indicate the number of client requests that are currently processed by the Exchange information store. Average time in seconds to read data from the database disk. Average time in seconds to write data to the database disk. Average time in seconds to write data to the log disk. Average time in seconds to read data from the log file. This is specific to log replay and database recovery operations. Must be below 70. The following link provides more information on performance counters: http://technet.microsoft.com/en-us/library/dd335215.aspx General threshold is below 20 ms. General threshold is below 100 ms. General threshold is below 10 ms. General threshold is below 200 ms. 25

Methodology Introduction This section explains the testing methodology used. The high-level steps in the testing are: Examine potential building block configurations and select one based on the best performance and capacity. Scale up the environment by using multiple building blocks to reach the maximum possible user count. Use the selected building block for the Exchange DAG configuration and determine the baseline performance. Simulate a WAN delay in the DAG replication network and measure the performance impact due to the delay. Back up the Exchange data by mounting the CLARiiON snapshots taken using Replication Manager and measure the Exchange performance. Testing methodology 1 The Jetstress tool was used to determine the ideal building block configuration on SATA disk spindles to estimate the number of Exchange users supported. The initial phase of testing included testing the performance of RAID 5, RAID 10, and RAID 6. For each building block configuration in Jetstress, the thread count was varied to understand the maximum possible IOPS supported in the configuration. The building block was determined based on the maximum IOPS for an optimal number of disk spindles, where the disk latencies were within the gating metrics. The next phase of testing included step-by-step scaling up of the selected building block to determine the maximum user count supported. All the tests were executed in a duration of two hours. Statistics were collected after the system reached a steady state. Testing methodology 2 The Jetstress tests were followed by the LoadGen tests that were executed with the selected building block for 10 hours. Resiliency tests were performed to understand how the Exchange environment under I/O load responds to failures. These tests were run on a stand-alone Exchange 2010 environment. The LoadGen tests included the following scenarios: SP failover during user load FC link failure during user load RAID group rebuild during user load SP failover To understand the impact of SP failover, multiple SP reboots were performed in the fifth hour and after the seventh hour during the LoadGen test. FC link failure During the LoadGen test, one of the iscsi cables was pulled out at around the fourth hour of the test to study the impact on the performance. The cable was plugged in after two hours. 26

RAID group rebuild To simulate the RAID group rebuild scenario, one of the disks that was a part of the RAID group was pulled out after four hours of the LoadGen test. After a gap of three hours, the disk was placed back. To study the impact of RAID group rebuild on performance, the LoadGen test was run in forever mode because of the time taken for the rebuild operation. Testing methodology 3 This test was conducted to check for any performance impact due to the WAN delay on Exchange 2010 DAG. The Exchange DAG was designed in such a way that the environment hosted two DAG copies of the databases in a local storage array and a single remote copy in a remote storage array for disaster recovery. LoadGen tests were executed in this environment to study the Exchange 2010 DAG baseline performance. To simulate a remote disaster recovery site, a WAN emulation tool was used for the two Exchange mailbox servers in the primary and disaster recovery sites and the impact of the WAN latencies was analyzed by varying the following: WAN latency to 30 ms during user load WAN bandwidth to T3 during user load The performance of the Exchange DAG with WAN latency was analyzed by running the LoadGen tests for 10 hours. Statistics were collected after the completion of the tests. Testing methodology 4 In this scenario, Replication Manager was used to take point-in-time snapshots of the Exchange 2010 DAG passive copies in the local site. Replication Manager verified the application consistency of the snapshot. The snapshots were mounted on the Exchange mailbox server that hosted the passive copies and the mount point was used for backing up the Exchange data. Windows native backup was used to take a file-based backup from the mount points. The performance of the Exchange DAG was analyzed by running the LoadGen tests for 24 hours. Statistics were collected after the completion of the tests to study the impact of the snapshot. The backup operations were performed during the Exchange load on the DAG environment. 27

Result analysis Overview Introduction The building block configuration was determined by configuring the underlying storage subsystem into various RAID configurations such as RAID 5, RAID 6, and RAID 10. The subsequent sections provide the various test results that were produced during the investigation of the building blocks. Comparison between RAID 5, RAID 6, and RAID 10 Introduction This test compared RAID 5, RAID 6, and RAID 10 configurations to determine the optimal RAID group configuration best suited for DAG testing and scaling up. Microsoft Exchange Jetstress tool was used for the purpose of testing. Each test was scheduled for a duration of two hours. RAID 5 was tested with spindle counts of five and seven with the RAID 5 (4+1) and the RAID 5 (6+1) configurations, respectively. For RAID 6, a 4+2 configuration, and for RAID 10, a 4+4 configuration were chosen. For all configurations, the databases and logs were created on the same set of spindles but on alternate LUNs. For all configurations, two databases and logs were created except for the 6+1 RAID 5 configuration, which had three databases and logs due to more space. Baseline results The following diagram shows a comparison of the performance of the various building blocks that were tested with Microsoft Jetstress. 28

The test results show that RAID 10 (4+4) produced the maximum IOPS, but with more disks. When compared to RAID 5 (4+1) and RAID 6 (4+2), RAID 5 performed better in terms of heavy profile user count. The following diagram shows a comparison of the Exchange database IOPS per disk and the Background Database Maintenance (BDM) IOPS per disk for all the RAID configurations. 29

The following diagram shows a comparison of the performance of the various building blocks explored from the storage side. The diagram also shows the disk throughput and the disk response time from CLARiiON. This includes RAID penalty, database maintenance, and log writes. The disk response time is highest in the RAID 5 (6+1) configuration when compared to the other RAID configurations. The disk throughput (IOPS) is maximum in the RAID 5 (4+1) configuration when compared to the other configurations. 30

The RAID 5 (4+1) configuration performs better when compared to RAID 6 (4+2) and RAID 5 (6+1). Considering the moderate performance and high storage capacity, the RAID 5 (4+1) configuration is best suited for the Exchange environment. The RAID 5 configuration with 4+1 disk spindles was chosen for future testing because of the optimal performance with a minimum number of spindles across all the configurations. Scale up test results with RAID 5 (4+1) configuration Scale-up tests were performed to determine if the system scales up to yield a linear increase in the performance for a larger environment. Comparison of the building blocks The following diagrams shows a comparison between one, three, six, and nine building blocks. The test results show that the user count increased almost in a linear manner. 31

Network MB total/s The following diagram shows the network MB total/s for the server NICs for each of the building blocks. Conclusion Network data transfer rate increased steadily as the number of building blocks increased. SP utilization The following diagram shows the average SP utilization for each of the building blocks. Conclusion SP utilization increased steadily as the number of building blocks increased. However, the SP utilization stayed below 30 percent. 32

Resiliency scenarios Introduction Resiliency is an important aspect in a customer environment. It is crucial to understand how the loaded system behaves during recoverable failure conditions. This section discusses the different resiliency scenarios such as SP reboot, network link failure, and RAID group rebuild. This section also studies the variation in the performance of Exchange when a recoverable condition occurs. SP reboot scenario Introduction A LoadGen test was run for around 10 hours and the SP was rebooted in the middle of the test run. This test shows the impact of the trespass of the database LUNs subsequent to the reboot. Test result analysis The SP reboot took three minutes to complete. The LUNs owned by SP A were trespassed back within five minutes from the start of the SP reboot. Average database latency and Exchange IOPS The following diagram shows the impact of the SP reboot on the read and write latencies of the Exchange database. NOTE: The Exchange IOPS is high because multiple LoadGen clients were used for the test. This was done to understand the impact when the storage system is under stress. 33

Observations The diagram shows the impact on the Exchange database read and write latencies and the IOPS 30 minutes before and after the SP reboot. The latencies had a minor impact during the SP reboot. The latencies settled down to normal values after the reboot. The impact on the IOPS was also less. Conclusion The SP reboot had a minor impact on the Exchange database latency and IOPS. RPC averaged latency The following diagram shows the RPC averaged latency. Conclusion There was a minor impact on the Exchange RPC averaged latency immediately after the reboot. This was due to the trespassing of the database LUNs. But, the latency came back to the normal value soon after. Network link failure scenario Introduction A LoadGen test was run for around 10 hours and a network link failure was simulated in the fifth hour of the test run. For this purpose, one of the iscsi connections in the storage network connecting the host and the backend was temporarily disabled. The connection was restored after two hours. Since there were multiple iscsi connections connecting the storage and the Jetstress server, a big impact was not expected. 34

Test result analysis The section explains the impact of the network link failure on the performance of Exchange Server one hour before and after the test. Average database latency and Exchange IOPS The following diagram shows the Exchange database average read and write latencies before and after the network link failure. Observations The test results show that the network link failure does not have much of an impact on the database read and write latencies. The IOPS reduced a little after the link failure. Conclusion When the values one hour before and after the network link failure were compared, it was observed that the read latency increased by 0.6 ms and the write latency increased by 0.35 ms because of the network link failure. The test results show that the resiliency scenario did not cause a continuous impact on the latencies. There was a minimal impact on the IOPS due to the link failure scenario. RPC averaged latency The following diagram shows the RPC averaged latency. 35

Conclusion There was no impact on the Exchange RPC averaged latency due to the network link failure scenario. The latency values remained almost constant after the link failure was simulated. RAID group rebuild scenario Introduction The RAID group rebuild resiliency scenario was simulated to determine the impact on the performance of Exchange when a disk that stores the databases is pulled out. This is a very important test case that closely monitors the impact of a possible realtime scenario where a disk may go faulty in a production environment. The subsequent replacement of the disk triggers a rebuild operation that may have an impact on the performance. The following test case analyzes the performance of Exchange during this scenario. A LoadGen test was run in the forever mode. During the test, a RAID group rebuild scenario was simulated. One of the SATA disks was pulled out after four hours of the start of the test. This disk was part of a RAID group that stored the database and log LUNs. The disk was placed back after four hours. The subsequent rebuild of the disk took 18 hours and 30 minutes. Test result analysis The section explains the impact of the disk pullout on the performance of Exchange Server. 36

Average database latency and Exchange IOPS The following diagram shows the Exchange database average read and write latencies. Observations The diagram shows the average latency variations an hour before and after the resiliency scenario. The resiliency scenario caused a minor increase in the read latency by 0.1 percent and in the write latency by 0.14 percent. The IOPS were marginally impacted by the resiliency event. Conclusion The resiliency scenario did not have a significant impact on the average database latencies and IOPS. 37

RPC averaged latency The following diagram shows the RPC averaged latency. Observations The diagram shows the Exchange RPC averaged latency before and after the resiliency scenario. After the initial minor spike, the RPC averaged latency came back to the normal value. Conclusion The RPC averaged latency exhibited a normal behavior and was not impacted by the resiliency scenario. Exchange 2010 DAG test Introduction This section explains the test results of the LoadGen test run on Exchange 2010 DAG environment for the following scenarios: Exchange 2010 DAG with 1,500 users with a 100 send/receive user profile. Exchange 2010 DAG with Exchange continuous replication over WAN for 1,500 users. Exchange 2010 DAG with a CLARiiON SnapView snapshot taken from the local passive copy. The snapshot was then mounted on the passive mailbox server in the DAG to take the backup using the Windows native backup method. The LoadGen test was run during the snapshot and backup operation to analyze the performance impact on Exchange Server. For all these test scenarios, the Exchange databases were stored on the iscsi LUN that was created using a RAID 5 (4+1) configuration of SATA disks. A similar configuration was used for the local and remote copies of the DAG. A heavy user load was simulated on the active databases to determine the database I/O performance. 38

Database read/write latency and IOPS The following diagram shows the performance results of Exchange Server 2010 DAG baseline tests on LAN and WAN, Exchange 2010 DAG with the CLARiiON snapshot, and the Windows native backup tests. This diagram shows a comparison of Exchange Server baseline performance with the WAN test and with the snapshot and backup test. Conclusion The diagram shows that the performance of Exchange Server was not affected much with the WAN delay in the DAG environment. Compared to the baseline test, there was a 21 percent decrease in the IOPS in the backup test. There was very little increase in the database read latency and around 50 percent increase in the write latency. This indicates that the performance of Exchange Server was affected by the snapshot and the backup test, but the latencies were below the gating matrices. Exchange RPC performance The RPC performance counters show how Exchange Server processes the RPC operations from different client protocols: RPC operations/s indicates the current number of RPC operations occurring per second. RPC request indicates the overall RPC requests currently being executed within the Exchange information store process. This must be below 70 at all times. If the RPC request increases steadily, it may result in a bottleneck on the mailbox server. RPC averaged latency indicates the RPC latency (in ms) averaged for all the operations in the last 1,024 packets. This should not be higher than 10 ms on an average. The following diagram shows the RPC operations/s, RPC request, and RPC averaged latency on the Exchange mailbox server for the three scenarios. 39

Conclusion The client RPC operations were not affected by the WAN delay. However, there was an approximate 20 percent decrease in the RPC operations because of the snapshot and backup operations. The RPC averaged latency increased by around 90 percent for the snapshot and backup test, but it was well below the threshold value of 10 ms. Exchange average log copy latency This counter shows the time that the Exchange replication process takes to copy the logs to the passive database location. This is important because if the log copy is delayed due to the snap and backup operations, then it could impact the replication process. The following diagram shows the average log copy latency of Exchange Server on the passive mailbox database for baseline, DAG on WAN, and snapshot and backup tests. 40

Conclusion The diagram shows that there was no major impact on the log copy to the passive remote mailbox server for the snap and backup operations. The latency behavior for the whole test showed a few spikes during the snap and backup operations. The log copy latency increased significantly when the WAN delay was introduced. But, it was still acceptable because the latency was much lower than the minimum log replay time of 30 seconds per log per database. Result analysis for the backup test with Exchange Introduction In this configuration, the Exchange databases were stored on an iscsi LUN that was created using one RAID 5 (4+1) configuration of SATA disks. A similar configuration was used for the local and remote copies of the DAG. The reserve LUN pool for the CLARiiON snapshot was created on FC disks with a 20 percent change rate calculation. The backup destination was also created on SATA disks with a RAID 5 (6+1) configuration. A heavy Exchange user load was simulated on the active databases to determine the database I/O performance. Exchange RPC operations The following diagram shows the RPC operations performed by Exchange Server during the baseline run for the following scenarios: before starting the snapshot operation, during the snapshot operation, and during the backup operation. 41

Conclusion The snapshot and backup operations did not significantly affect the performance of Exchange Server when processing RPC operations. However, based on the baseline Exchange RPC performance results, there was a reduction in the RPC operations during the snapshot and backup tests. RPC averaged latency The following diagram shows the RPC averaged latency during the snapshot and backup test when the snapshot was taken from the passive database. Conclusion The diagram shows that there was almost no impact of the snapshot and backup operations on the RPC requests processing because the latency did not show any spikes during those operations. Also, the RPC average latency was well below its threshold value of 10 ms during the test. 42

Exchange average I/O log read latency This counter indicates the average time (in ms) required to read the data from a log file. It is specific to log replay and database recovery operations. When the log read latency exceeds the gating metrics, the database copy may lag by not replaying logs to the passive database copy fast enough. The log replication performance may also be affected. The average value should be below 200 ms. The following diagram compares the log read latency on the passive database for the baseline and backup tests. The passive database was used to take the snapshot and backup. Conclusion The snapshot duration shown in the diagram actually includes both the snapshot operation and the Exchange database consistency check. Replication Manager includes both the operations in the same job. The snapshot operation took around 15 to 20 minutes. Then, Replication Manager mounted the database and started the database consistency check operation. During the snapshot operation (more specifically during the database consistency check), the log read latency for the replay activity was affected and it continued to run high until the end. This was because of the additional I/O activity on the database disks due to the database consistency check. The latency came back to normal after the database consistency check was completed. The latency was also high during the backup operation, but not as high as in the snapshot operation. It coincided with the baseline figure for the rest of the test cycle. Though the log read latency was affected during snapshot and backup operations, it was well below the threshold value. 43

Exchange database I/O read and write latency comparison between active and passive databases with snapshot and backup This section compares the database read and write latencies during the snapshot and backup operations on the active and passive databases. Both tests were performed in a similar manner. For the active database test, the CLARiiON snapshot was taken on the active database serving the MAPI clients and mounted on the active mailbox server to take the backup from the mount points when the LoadGen test with a heavy profile was running continuously. Similarly, for the passive database test, the passive databases were used to take the snapshot. The following diagrams show the I/O database read latency and the I/O database write latency comparisons on the active databases of the two use cases. The first diagram shows the Exchange database I/O read and write latency pattern for the passive database test. The second diagram shows the Exchange database read and write latency pattern for the active database test. 44

Conclusion Both the diagrams show that the read and write latency was severely affected on the active database when the snapshot operation was running. This was also specifically observed during the Exchange database consistency check operation as mentioned earlier. There was almost a 30 percent increase in the average read latency when the snapshot and backup tests were performed on the active database and it almost reached the threshold value of 20 ms. Also, the passive database was affected by the snapshot operation but not to the extent of the active database. The latency was also high during the backup operation. After the snapshot was taken, the major impact of the database consistency check operation was observed on the write latency when it was run on the active database. For the passive database test, when the snapshot and backup operations were performed, the active database write latency was not affected. For the active database test, the write latency was affected by more than 200 percent and the average write latency was above the threshold value of 20 ms. Also, during the snapshot operation, the write latency was continuously high, that is between 40 ms to 80 ms, because the database consistency check operation kept the active database disks busy. 45