Performance Monitoring Always On Availability Groups. Anthony E. Nocentino

Similar documents
Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

High Availability- Disaster Recovery 101

I Needed to Install 80 SQL Servers Fast. Here s How I Did It! Anthony E. Nocentino

High Availability- Disaster Recovery 101

AlwaysOn Availability Groups: Backups, Restores, and CHECKDB

Are AGs A Good Fit For Your Database? Doug Purnell

Microsoft SQL AlwaysOn and High Availability

Microsoft SQL AlwaysOn and High Availability

Troubleshooting Always On Availability Groups Performance

New England Data Camp v2.0 It is all about the data! Caregroup Healthcare System. Ayad Shammout Lead Technical DBA

SQL Server Availability Groups

EMC RecoverPoint. EMC RecoverPoint Support

Linux OS Fundamentals for the SQL Admin. Anthony E. Nocentino

Linux OS Fundamentals for the SQL Admin. Anthony E. Nocentino

Ryan Adams Blog - Twitter MIRRORING: START TO FINISH

EMC Celerra Replicator V2 with Silver Peak WAN Optimization

Microsoft SQL Server Fix Pack 15. Reference IBM

NetApp Element Software Remote Replication

Architecting a Hybrid Database Strategy with Microsoft SQL Server and the VMware Cloud Provider Program

LINUX OS FUNDAMENTALS FOR THE SQL ADMIN

Using Double-Take Software and the Virtual Recovery Appliance

SQL Server 2014: In-Memory OLTP for Database Administrators

Ryan Adams Blog - Twitter Thanks to our Gold Sponsors

Avoiding the Cost of Confusion: SQL Server Failover Cluster Instances versus Basic Availability Group on Standard Edition

SQL Server Virtualization 201

Maximizing SharePoint Availability Whitepaper v1.1 4/2018

Transactional Replication New Features for AlwaysOn AG in SQL Henry Weng Premier Field Engineer SQL Server & AI

ECE Engineering Robust Server Software. Spring 2018

PERFORMANCE TUNING SQL SERVER ON CRAPPY HARDWARE 3/1/2019 1

5/24/ MVP SQL Server: Architecture since 2010 MCT since 2001 Consultant and trainer since 1992

Persistence Is Futile- Implementing Delayed Durability in SQL Server

IBM MQ Appliance HA and DR Performance Report Model: M2001 Version 3.0 September 2018

The Right Choice for DR: Data Guard, Stretch Clusters, or Remote Mirroring. Ashish Ray Group Product Manager Oracle Corporation

Microsoft SQL Server Database Administration

Lessons Learned Operating Active/Active Data Centers Ethan Banks, CCIE

TSM Paper Replicating TSM

Efficiently Backing up Terabytes of Data with pgbackrest. David Steele

MySQL High Availability Solutions. Alex Poritskiy Percona

PERFORMANCE TUNING TECHNIQUES FOR VERITAS VOLUME REPLICATOR

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

vsphere Replication 6.5 Technical Overview January 08, 2018

Agenda. What is Replication?

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

Microsoft Administering Microsoft SQL Server 2012/2014 Databases. Download Full version :

IBM MQ Appliance HA and DR Performance Report Version July 2016

Monitoring Linux Performance for the SQL Server Admin. Anthony Nocentino, Enterprise Architect, Centino Systems

Windows Clustering 101

Managing Data Center Interconnect Performance for Disaster Recovery

The Data Protection Rule and Hybrid Cloud Backup

Virtualization And High Availability. Howard Chow Microsoft MVP

Eliminate Idle Redundancy with Oracle Active Data Guard

Index. Peter A. Carter 2016 P.A. Carter, SQL Server AlwaysOn Revealed, DOI /

Oracle E-Business Availability Options. Solution Series for Oracle: 2 of 5

AimBetter Database Monitor - Version

Ed Watson, MVP Ambassador of Mayhem. LinkedIn.com/in/WatsonEd

The Google File System

How to Improve your Resiliency. Lebanon s Banking Community

Exadata Implementation Strategy

Data Loss and Component Failover

Synergetics-Standard-SQL Server 2012-DBA-7 day Contents

Estimate performance and capacity requirements for Access Services

Exam : S Title : Snia Storage Network Management/Administration. Version : Demo

RA-GRS, 130 replication support, ZRS, 130

2. Recovery models ->

Amazon Aurora Deep Dive

Veritas Volume Replicator Planning and Tuning Guide

Presented By Chad Dimatulac Principal Database Architect United Airlines October 24, 2011

Managed IT Solutions. Managed IT Solutions

Oracle Active Data Guard

List of metrics. Machine metrics. Machine: processor time

Functional Testing of SQL Server on Kaminario K2 Storage

Designing Database Solutions for Microsoft SQL Server (465)

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

WHITE PAPER. Header Title. Side Bar Copy. Header Title 5 Reasons to Consider Disaster Recovery as a Service for IBM i WHITEPAPER

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

SQL Server HA and DR: A Simple Strategy for Realizing Dramatic Cost Savings

Amazon Aurora Relational databases reimagined.

REC (Remote Equivalent Copy) ETERNUS DX Advanced Copy Functions

Data Guard Deep Dive

HIGH-AVAILABILITY & D/R OPTIONS FOR MICROSOFT SQL SERVER

Chapter 11. SnapProtect Technology

Gianluca Sartori. Benchmarking Like a PRO

Carbonite Availability. Technical overview

The Google File System

Exadata Implementation Strategy

StarWind Virtual SAN Windows Geo-Clustering: SQL Server

Accelerate Applications Using EqualLogic Arrays with directcache

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

System recommendations for version 17.1

Oracle Active Data Guard

Microsoft SQL Server

Building a 24x7 Database. By Eyal Aronoff

EXAM Administering Microsoft SQL Server 2012 Databases. Buy Full Product.

Step into the future. HP Storage Summit Converged storage for the next era of IT

Get the Skinny on Minimally Logged Operations

Transcription:

Performance Monitoring Always On Availability Groups Anthony E. Nocentino aen@centinosystems.com

Anthony E. Nocentino Consultant and Trainer Founder and President of Centino Systems Specialize in system architecture and performance Computer Science M.S., B.S. Microsoft MVP - Data Platform - 2017 Friend of Redgate - 2015-2017 email: aen@centinosystems.com Twitter: @nocentino Blog: www.centinosystems.com/blog Pluralsight Author: www.pluralsight.com

Overview Motivation How Availability Groups move data Impact of replication latency on availability Monitoring techniques Demo Dealing with replication latency

Why is this important? Recovery Objectives Recovery Point Objective - RPO Recovery Time Objective - RTO Availability How much data can we lose? How fast will the system fail over? Monitoring and Trending Establish a baseline for analysis - are we meeting those objectives? Impact on resources Ownership All of the components are monitored by the DBA

Data Movement In Availability Groups Transaction log blocks are replicated to secondaries Replication mode Synchronous Asynchronous Database mirroring endpoint

Network Based Replication Strong working relationship with network team Maintenance - patching, network outages, database Network conditions can impact your AG s availability Latency - how long it takes for a packet of data to traverse the network from source to destination. Bandwidth - how much data can be moved in a time interval

Network Latency Often measured in milliseconds, sometimes microseconds Directly impacts network throughput TCP Congestion Window ping isn t your best measure of latency, by default it doesn t include any load measure your workload It s often up to us to PROVE to the network team there is an issue Pinging 192.168.2.1 with 32 bytes of data: Reply from 192.168.2.1: bytes=32 time=1001ms TTL=128 In synchronous mode, you have to wait for network latency

Availability Group Flow Control Used in response to network and system conditions Log blocks exchange sequence numbers The AG will enter flow control mode IF: The primary detects too many unacknowledged messages, the primary stops sending messages The secondary needs to tell the primary to back off, likely due to resource constraints, it will send a flow control message to the primary to back off Primary polls every 1000ms for a change in flow control state Secondary will message primary to leave flow control mode From: SQL Server PFE Blog - http://bit.ly/1zpgyil

Database Synchronization States Not synchronizing Synchronized Synchronizing Reverting Initializing https://msdn.microsoft.com/en-us/library/ff877972.aspx

Failover Modes Automatic Synchronous mode only Synchronization state must by synchronized Commonly used within a data center Manual Synchronous or Asynchronous Commonly used between data centers https://msdn.microsoft.com/en-us/library/hh213151.aspx

Send Queue Queues log blocks to be sent to the secondaries Each replica maintains it s own view of the send queue Queued data is as risk to data loss in the event of a primary failure The send queue can grow due to an unreachable secondary, network outage, network latency and large amount of data change

Redo Queue Queues log blocks received on the secondary Each replica has it s own redo queue On failover, the redo queue must be completely processed The redo queue can grow due to a slow disk subsystem or resource contention or sustained outage and subsequent reconnection of a secondary

Send Queue Impact on Availability When log generation on primary exceeds the rate they can be sent to the secondaries No automatic failover Data loss Stale data for reporting from secondaries Stale data for off-loaded backups on secondaries Off-loaded log backups can fail Transaction delay Fill up transaction logs Even in synchronous mode!

Redo Queue Impact on Availability When log blocks received on the secondary exceed the rate they can be processed by the redo thread Delayed failover Detect failure Process Redo Queue Crash recover database Stale data for reporting from secondaries Stale data for off-loaded backups on secondaries Off-loaded log backups can fail Transaction delay

Log Backups

Transaction Delay In synchronous mode, when secondaries are behind, queries on the primary can be delayed HADR_SYNC_COMMIT HADR_SYNCHRONIZING_THROTTLE - replica back online

Disconnected Replica When a synchronous secondary exceeds it s session time-out, it changes to asynchronous commit mode When the secondary comes back online, it will attempt to re-sync, and resume synchronous mode HADR_SYNCHRONIZING_THROTTLE - time to go from SYNCHRONIZING to SYNCHRONIZED WRITELOG - while replica is down

Maintenance Events That Can Impact Availability Bulk data modifications Database maintenance Network or server maintenance Carefully plan maintenance Collaborate with other teams!

Monitoring AG Performance Dynamic Management Views sys.dm_hadr_database_replica_states Perfmon Counters SQL Server:Availability Replica Replication data - messages sent, bytes sent, flow control SQL Server:Database Replica Database data - log bytes sent, queue sizes, transaction delay per database

Measuring Replication Latency sys.dm_hadr_database_replica_states log_send_queue_size log_send_rate redo_queue_size redo_queue_rate On the primary there s a row for each database on each replica On the secondaries there s a row for each database on that replica Replicas track their own values When a replica goes offline log_send_queue_size changes to NULL log send queue is from primary to secondary

Measuring Replication Latency - ugh!!! Well, it looks like sys.dm_hadr_database_replica_states doesn t report the correct values for log_send_rate and redo_queue_rate Last know sent average throughput???? what is that? Reported on Connect https://connect.microsoft.com/sqlserver/feedback/details/928582 Known bug in SQL Server 2012 or 2014 https://support.microsoft.com/en-us/kb/3012182 Cumulative Update 5 or better Observed in SQL 2016 and 2017 Perfmon!

Measuring Latency with Perfmon Primary SQLServer:Databases - Log Bytes Flushed/sec SQLServer:Availability Replica - Bytes Sent to Replica/sec (compressed) Network Interface - Bytes Sent/sec Secondaries SQLServer:Availability Replica - Bytes Received From Replica/sec (compressed) SQLServer:Database Replica - Log Bytes Received/sec (log send rate/uncompressed) SQLServer:Database Replica - Redone Bytes/sec (log redo rate) Network Interface - Bytes Received/sec

Wait stats - sync vs. async Synchronous - HADR_SYNC_COMMIT sqldk.dll!xesospkg::wait_info::publish+0x138 sqldk.dll!sos_scheduler::updatewaittimestats+0x2bc sqldk.dll!sos_task::postwait+0x9e sqlmin.dll!eventinternal<suspendqueueslock>::wait+0x1fb sqlmin.dll!sequencedobject<logblockid,sequencedwaitinfo<logblockid>0>::waituntilsequenceadvances+0x160 sqlmin.dll!hadrcommitmgr::hardennotifyinternal+0x1af sqlmin.dll!hadrcommitmgr::hardennotify+0xac sqlmin.dll!recoveryunit::notifyhardenparticipants+0x1e2 sqlmin.dll!recoveryunit::hardenlog+0x217 sqlmin.dll!xdesrmfull::commitinternal+0x6b8 sqlmin.dll!xactrm::singlephasecommit+0x1a1 sqlmin.dll!xactrm::commitinternal+0x472 Asynchronous - WRITELOG sqldk.dll!xesospkg::wait_info::publish+0x138 sqldk.dll!sos_scheduler::updatewaittimestats+0x2bc sqldk.dll!sos_task::postwait+0x9e sqlmin.dll!sqlserverlogmgr::waitlcflush+0x219 sqlmin.dll!sqlserverlogmgr::logflush+0x29e sqlmin.dll!sqlserverlogmgr::waitlogwritten+0x17 sqlmin.dll!recoveryunit::hardenlog+0x25e sqlmin.dll!xdesrmfull::commitinternal+0x6b8 sqlmin.dll!xactrm::singlephasecommit+0x1a1 sqlmin.dll!xactrm::commitinternal+0x472

Measuring Latency with Extended Events Primary hadr_log_block_group_commit log_block_pushed_to_logpool log_flush_start hadr_log_block_compression hadr_capture_log_block ucs_connection_send_msg hadr_log_block_send_complete log_flush_complete hadr_receive_harden_lsn_message hadr_db_commit_mgr_harden In SQL 2014 - SP2/2016 - SP1 Each event has a measure duration Synchronous Secondary hadr_transport_receive_log_block_message hadr_log_block_decompression hadr_apply_log_block log_block_pushed_to_logpool log_flush_start log_flush_complete hadr_send_harden_lsn_message ucs_connection_send_msg hadr_lsn_send_complete

Monitoring Tools Build your own AlwaysOn Dashboard Third Party Tool SentryOne SQL Sentry for SQL Server Redgate SQL Monitor

Demo

Demo

Demo

Real World Example 248GB 109GB

Dealing With Slow Replication Latency Identify your bottleneck and mitigate it Minimize log generation Use smart index maintenance/better Indexes More bandwidth Perhaps a dedicated network connection Better hardware Log throughput on secondaries needs to be equal to primary Upgrade SQL Server 2012/2014 single threaded redo - ~45MB/sec 2016 multi-threaded redo - ~600MB/sec

Key Takeaways It is imperative to track and trend replication latency in your Availability Groups so you can answer the questions How much data can will I lose? How long it will take to failover? Monitor and trend send_queue and redo_queue in sys.dm_hadr_database_replica_states on replicas to measure availability impact Understand how much log is generated in your databases Understand your system s operations, consider downtime for patching and network maintenance

Key Takeaways Plan database maintenance Use a smart index maintenance strategy! Offloaded backups If availability is most important, backup on primary

Need more data or help? http://www.centinosystems.com/blog/talks/ Links to resources Demos Presentation aen@centinosystems.com @nocentino www.centinosystems.com Solving tough business challenges with technical innovation

Questions?

References http://www.centinosystems.com/blog/sql/designing-for-offloadedbackups-in-alwayson-availability-groups/ http://www.centinosystems.com/blog/sql/designing-for-offloadedlog-backups-in-alwayson-availability-groups-monitoring/ http://www.centinosystems.com/blog/sql/monitoring-availabilitygroups-with-redgates-sql-monitor https://msdn.microsoft.com/en-us/library/ff878537.aspx https://msdn.microsoft.com/en-us/library/ff877972.aspx