Patient A SQL Critical Care Part 1: Health Triage Findings

Background PatientA got in touch because they were having performance pain with $VENDOR s applications. PatientA wasn t sure if the problem was hardware, their configuration, or something in $VENDOR s code. $VENDOR wasn t sure either, but they were pretty sure it wasn t their code. PatientA also wanted help with maintenance and best practices Check out what happened next!

We re Brent Ozar Unlimited. RICHIE RUMP ANGELA WALKER DOUG LANE TARA KIZER BRENT OZAR JESSICA CONNORS ERIK DARLING

What instance did we look at? Instance: [REDACTED] Applications involved: [REDACTED] Memory Size: 144GB Number of logical cores: 8 vcpu SQL Server version and edition: SQL Server 2012 SP1 (with fix for initial SP1 release). Enterprise Edition. Virtualized? Yes. [REDACTED DETAILS]

RPO and RTO Recovery Point Objective (RPO): How many minutes of data can you lose in a worst case scenario? Recovery Time Objective (RTO): How many minutes can you be offlinein a worst case scenario? Minutes of data loss allowed Minutes of downtime allowed Server Offline Corrupt data Oops deletes Datacenter offline 4 hours 4 hours 4 hours 4 hours 4 hours 4 hours 4 hours 4 hours

This presentation focuses on avoiding data loss We re not solving all your pain points here We re talking about: 1. Making sure you don t lose more data than is acceptable, even if multiple things go wrong 2. Detecting corruption if it happens 3. Getting data back to a point in time if it s incorrect or corrupt This can be done with pretty simple steps, there s just a few of them

Your Steps to Prevent Data Loss

Take regular log backups Determine the frequency of them this way: If we lost the SAN, how much data do we want to lose? For RedactedImportantDB, this means backing up the transaction log every 5 minutes or every 1 minute (just answer the question above) http://www.brentozar.com/archive/2014/02/backtransaction-logs-every-minute-yes-really/

Regular log backups can help avoid this Every time these happen, everyone doing a write hast to wait a very long time! The problem is multiple things: The log file was shrunk Log backups aren t running frequently enough to let it reuse space in the file Log growths are set to a percentage, not a fixed unit REDACTED: screenshot showing 2-3 minute log growths

Regular log backups are good for performance, too! Your existing maintenance did one log backup each day, at night: One log backup Shrink the log All throughout the day, the log was periodically having to grow This led to big delays You re much better off doing frequent log backups and never shrinking the log file!

Don t back up to local storage You re currently taking full backups to a drive on the production VM, using the same SAN This is risky What if the SAN failed? What if the VM couldn t start up how long would it take to get to the files? A better option Run the all backups to a UNC path for a fileshare using separate storage

Speeding up backups: short term There s a few tricks to this For large databases, writing the backup out to multiple files can improve efficiency With Ola Hallengren s backup job, the @NumberOfFiles parameter can do this easily Try @NumberOfFiles=4 first on your RedactedImportantDBdatabase Test restoring the database too, so you know how the commands work Third party tools like $SQLVENDORTOOL1 make tuning backup speed, filecounts, and restores easier

Test your backups: can you meet your RTO? Test restoring database full and log backups to a nonproduction server Can you restore many log files quickly? Can you restore to a single point in time? Find a PowerShell script or tool (but beware scripts that require xp_cmdshell shouldn t be needed for this task) Document and /or automate the process Can someone else do this quickly if you re not available? Have you completed this within your RTO of 4 hours?

Manage backup history and messages Set up a job to purge backup history from MSDB once a week using sp_delete_backuphistory Ola Hallengren s maintenance includes a job for this: Ola.Hallengren.com You provide the @OldestDateparameter Implement trace flag 3226 so that successful backups don t get written to the SQL Server Error log Failures will still be written All backups will still be recorded in MSDB http://msdn.microsoft.com/enus/library/ms188396.aspx

Speeding up backups: long term $SANVENDOR$ $SANTOOL$can help speed up backups This does require more licensing at the SAN level May require reconfiguring drive assignments for data files But it also can replicate the backups to other locations, which is useful for DR Danger: currently your SAN snapshots do not talk to the VSS provider your databases may be corrupt and non-usable upon restore!

Create and test SQL Agent alerts Create alerts for high severity errors and corruption Set the alerts to notify you through an operator and database mail Script to create them: http://brentozar.com/go/alert To test that it all works, run: RAISERROR('TESTALERT',18,1) WITH LOG; Make sure you get the email!

Set Agent Jobs to notify on failure Lots of jobs currently won t let an operator know if they fail (too many to list here just run our sp_blitz script anytime to get a full list) Set the jobs to notify your operator when they fail At minimum, do this for all database and log backup jobs To test: set one of the jobs to notify on completion and make sure you get the email. Then switch it back to notify on failure.

Set a failsafe operator Tell your SQL Server Agent who it can notify in case of an emergency Set this on SQL Agent properties Note: When you first set up the mail profile, you will need to restart the SQL Server Agent service.

You need CHECKDB, but you need to be careful CHECKDB reads pages from disk and looks for corruption This should be run against all system and user databases One place to be cautious: This is performance intensive, and your RedactedImportantDBdatabase doesn t have a lot of free time in the nightly maintenance cycle

Automate CHECKDB for all databases Make sure you find out about corruption as soon as you can 1. Add CHECKDB for all databases except RedactedImportantDB on a nightly basis Exclude RedactedImportantDB from Ola Hallengren s job and schedule it nightly 2. Add CHECKDB for RedactedImportantDB databases on a weekly basis This will require a copy of the job which only does that database Set up the job on both replicas Make sure that this does not overlap with full backups or index maintenance, ever! Watch the job closely on the first run and make sure it doesn t cause a failover or those 15 second messages in the SQL Server log

Don t delete backups until you run CHECKDB You can end up with data corruption in production and in all your backups Rule of thumb: only delete a full backup after you get a reasonable assurance that it isn t corrupt You can do this by: Restoring the backup and running CHECKDB Running CHECKDB against the production database after the full backup completes Make sure: You have the backups on separate storage You re retaining enough copies to answer what was the data like N days ago? Some customers keep multiple copies of backup files if restoring to a historical point in time can be important to their business.

Enable the Remote DAC And practice using it! This is the Dedicated Admin Connection One sysadmin can use it at a time SQL Server reserves a special CPU for it (even if you don t have remote access enabled) This comes in very handy in a performance crisis Learn how to enable it and practice using it at BrentOzar.com/go/DAC

Questions?

Now let s get you in to see one of our performance specialists. Not quite like this guy