The Day the DNS Died

Similar documents
Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

How to Install Forcepoint NGFW in Amazon AWS TECHNICAL DOCUMENT

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

Update on experimental BIND features to rate-limit recursive queries

Is Your Project in Trouble on System Performance?

BARCELONA. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

August 14th, 2018 PRESENTED BY:

Lessons Learned Operating Active/Active Data Centers Ethan Banks, CCIE

How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study

[MS10987A]: Performance Tuning and Optimizing SQL Databases

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

Logging, Monitoring, and Alerting

amazon.com s Journey to the Cloud Jon Jenkins AWS Summit June 13, 2011

Migrating Existing Applications to AWS. Matt Tavis Principal Solutions Architect

Hackproof Your Cloud Responding to 2016 Threats

A10 HARMONY CONTROLLER

DNS Configuration Guide. Open Telekom Cloud

CDN TUNING FOR OTT - WHY DOESN T IT ALREADY DO THAT? CDN Tuning for OTT - Why Doesn t It Already Do That?

How to Configure Route 53 for F-Series Firewalls in AWS

Security: Michael South Americas Regional Leader, Public Sector Security & Compliance Business Acceleration

Copyright 2016 Pivotal. All rights reserved. Cloud Native Design. Includes 12 Factor Apps

FRNOG 25 Meeting: BIND9 Recursive Client Rate limiting

CIT 668: System Architecture. Amazon Web Services

NGF0502 AWS Student Slides

SQL Server Administration 10987: Performance Tuning and Optimizing SQL Databases. Upcoming Dates. Course Description.

AWS_SOA-C00 Exam. Volume: 758 Questions

Actian PSQL Vx Server Licensing

Diagnosing the cause of poor application performance

AWS Well Architected Framework

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

WhatsConfigured v3.1 User Guide

Additional Security Services on AWS

Aurora, RDS, or On-Prem, Which is right for you

AWS Administration. Suggested Pre-requisites Basic IT Knowledge

Service Mesh and Microservices Networking

Pervasive PSQL Vx Server Licensing

AWS Solution Architecture Patterns

Understanding Perimeter Security

Going Serverless. Building Production Applications Without Managing Infrastructure

Hosting Roadmap Upgrades, Improvements and Changes

Course Outline. Lesson 2, Azure Portals, describes the two current portals that are available for managing Azure subscriptions and services.

SmartDNS. Speed: Through load balancing, FatPipe's SmartDNS speeds up the delivery of inbound traffic.

Creating Your Virtual Data Center

ControlUp v7.1 Release Notes

Course Outline. Performance Tuning and Optimizing SQL Databases Course 10987B: 4 days Instructor Led

A Cloud Gateway - A Large Scale Company s First Line of Defense. Mikey Cohen Manager - Edge Gateway Netflix

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

SECURITY ON AWS 8/3/17. AWS Security Standards MORE. By Max Ellsberry

Amazon AWS-Solutions-Architect-Professional Exam

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Test Methodology We conducted tests by adding load and measuring the performance of the environment components:

Chapter 6: Distributed Systems: The Web. Fall 2012 Sini Ruohomaa Slides joint work with Jussi Kangasharju et al.

Starting the Avalanche:

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

Title: Planning AWS Platform Security Assessment?

MD-100: Modern Desktop Administrator Part 1

Performance Tuning & Optimizing SQL Databases Microsoft Official Curriculum (MOC 10987)

<Insert Picture Here> Oracle Application Cache Solution: Coherence

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Exploring Cloud Security, Operational Visibility & Elastic Datacenters. Kiran Mohandas Consulting Engineer

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

COURSE 10982: SUPPORTING AND TROUBLESHOOTING WINDOWS 10

4 Effective Tools for Docker Monitoring. By Ranvijay Jamwal

Research Faculty Summit Systems Fueling future disruptions

Using DNS Service for Amplification Attack

Building a Modular and Scalable Virtual Network Architecture with Amazon VPC

Course 10982B: Supporting and Troubleshooting Windows 10

Coherence An Introduction. Shaun Smith Principal Product Manager

Overview. Audience Profile. At Course Completion. Module Title : 10982B: Supporting and Troubleshooting Windows 10. Course Outline :: 10982B::

Low Latency Data Grids in Finance

Test - Accredited Configuration Engineer (ACE) Exam - PAN-OS 6.0 Version

MongoDB in AWS (MongoDB as a DBaaS)

Supporting and Troubleshooting Windows 10

Course Outline. Developing Microsoft Azure Solutions Course 20532C: 4 days Instructor Led

A SKY Computers White Paper

Oracle Database 11g: Real Application Testing & Manageability Overview

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Developing Microsoft Azure Solutions: Course Agenda

OnCommand Cloud Manager 3.2 Deploying and Managing ONTAP Cloud Systems

AWS FREQUENTLY ASKED QUESTIONS (FAQ)

Using SQL Server on Amazon Web Services

CogniFit Technical Security Details

AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS

CACHE ME IF YOU CAN! GETTING STARTED WITH AMAZON ELASTICACHE. AWS Charlotte Meetup / Charlotte Cloud Computing Meetup Bilal Soylu October 2013

Build planetary scale applications with compartmentalization

Amazon Search Services. Christoph Schmitter

Re-engineering the DNS One Resolver at a Time. Paul Wilson Director General APNIC channeling Geoff Huston Chief Scientist

Configuring Basic Interface Parameters

INCIDENTRESPONSE.COM. Automate Response. Did you know? Your playbook overview - Data Theft

Dell EMC Isolated Recovery

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Administration 1. DLM Administration. Date of Publish:

Lenovo ThinkSystem NE Release Notes. For Lenovo Cloud Network Operating System 10.6

Amazon AWS-Solution-Architect-Associate Exam

Getting Started Guide. VMware NSX Cloud services

ECE 650 Systems Programming & Engineering. Spring 2018

Using WireShark to support the Application June 16, 2011

Developing Microsoft Azure Solutions (MS 20532)

Transcription:

The Day the DNS Died Jeremy Blosser, Principal Operations Engineer jblosser@sparkpost.com https://tinyurl.com/spdnstalk 1

Introduction SparkPost, aka Message Systems, is a high-volume, transactional email software and services vendor. 2

Introduction SparkPost, aka Message Systems, is a high-volume, transactional email software and services vendor. We send a lot of email: Over 30% of the world s non-spam email is sent using our software. 15B messages/month sent via our cloud offering. 3

Introduction SparkPost, aka Message Systems, is a high-volume, transactional email software and services vendor. We send a lot of email: Over 30% of the world s non-spam email is sent using our software. 15B messages/month sent via our cloud offering. That requires a lot of DNS: 8,000 queries/second. 20Mb/s+ sustained traffic just for DNS queries. Several different resolution paths. 4

Introduction But DNS is easy, right? 5

Introduction But DNS is easy, right? 6

Outline Introduction Previous DNS Design(s) May 2017 Outage New DNS Design Lessons Learned / Remembered References Questions? 7

Previous DNS Design(s) 8

Version 1, Centralized Internal Resolver Cluster 9

Version 1, Centralized Internal Resolver Cluster 10

Version 1, Centralized Internal Resolver Cluster 11

Version 1, Centralized Internal Resolver Cluster 12

Version 1, Centralized Internal Resolver Cluster 13

Version 1, Centralized Internal Resolver Cluster 14

Version 1, Centralized Internal Resolver Cluster 15

Version 1, Centralized Internal Resolver Cluster 16

Version 1, Centralized Internal Resolver Cluster 17

Version 1, Centralized Internal Resolver Cluster 18

Version 1, Centralized Internal Resolver Cluster 19

Version 1, Centralized Internal Resolver Cluster 20

Version 2, AWS VPC Resolver 21

Version 2, AWS VPC Resolver 22

Version 1.5, Centralized Internal Resolver Cluster 23

Version 1.5, Centralized Internal Resolver Cluster 24

Version 3.14, Centralized Internal Resolver Cluster 25

May 2017 26

May 2017 Outage A day like any other day until... 27

May 2017 Outage A day like any other day until... 28

May 2017 Outage A day like any other day until... 29

May 2017 Outage A day like any other day until... 30

May 2017 Outage A day like any other day until... 31

May 2017 Outage A day like any other day until... 32

May 2017 Outage DNS Cluster Aggregate CPU 33

May 2017 Outage MTA Cluster Aggregate CPU 34

May 2017 Outage MTA Cluster Aggregate CPU 35

May 2017 Outage Mail Delivery (one customer) 36

May 2017 Outage (Near) Total Impact Sending mail 37

May 2017 Outage (Near) Total Impact Sending mail - (most) customer mail injection not impacted 38

May 2017 Outage (Near) Total Impact Sending mail - (most) customer mail injection not impacted App/DB traffic 39

May 2017 Outage (Near) Total Impact Sending mail - (most) customer mail injection not impacted App/DB traffic Metrics 40

May 2017 Outage (Near) Total Impact Sending mail - (most) customer mail injection not impacted App/DB traffic Metrics Config management (partial) 41

May 2017 Outage (Near) Total Impact Sending mail - (most) customer mail injection not impacted App/DB traffic Metrics Config management (partial) Admin logins 42

May 2017 Outage (Near) Total Impact Sending mail - (most) customer mail injection not impacted App/DB traffic Metrics Config management (partial) Admin logins 43

May 2017 Outage 44

May 2017 Outage Diagnosing Blind 45

May 2017 Outage Diagnosing Blind Lack of insight into our DNS Unable to reach support systems 46

May 2017 Outage Diagnosing Blind Lack of insight into our DNS Unable to reach support systems Is it throttling (again)? Is it capacity (again)? 47

May 2017 Outage Diagnosing Blind Lack of insight into our DNS Unable to reach support systems Is it throttling (again)? - Central forward to VPC Resolver Immediately overrun Is it capacity (again)? - Add capacity Immediately affected 48

May 2017 Outage Mitigation Repoint individual instances to VPC Resolver - Edit resolv.conf 49

May 2017 Outage resolv.conf Limited to 3 entries Always tried top to bottom Limited practical retry Read on app startup - Changes require restarts 50

May 2017 Outage Mitigation Repoint individual instances to VPC Resolver - Edit resolv.conf, with restarts - Provided breathing room Main resolver cluster recovered as load was removed App tier recovery: 2 hours Major customer mail recovery: 4-5 hours Time to full recovery: 7 hours 51

May 2017 Outage Mitigation Webhook SQS Queued Messages 52

May 2017 Outage Diagnosis Asymmetric DNS packet flow - Tcpdump - AWS Network Flow Logs Average 300 responses per 5000 queries (94% failure) 53

May 2017 Outage The Cause? 54

May 2017 Outage The Cause? Connection Tracking 55

May 2017 Outage The Cause? [Undocumented] Connection Tracking 56

May 2017 Outage The Cause? [Undocumented] Connection Tracking 57

May 2017 Outage After Action Conclusions Incident response process was functional Ability to respond via the process was compromised Limits of iteration New DNS design required 58

New DNS Design 59

New DNS Design Requirements Resolve all needed name sources Modifiable without changing resolv.conf Avoid throttling No conntrack Multi cluster / isolate components Distributed across resolver clusters Minimize latency Effective caching Respect TTLs Increase DNS profiling and monitoring 60

New DNS Design 61

New DNS Design Network Configuration Dedicated VPC for isolation Open Security Groups with stateless ACLs Separate resolver clusters to isolate impacts Query traffic favors same Availability Zone 62

New DNS Design 63

New DNS Design Resolver (Unbound) Configuration Instance and service tuning Multiple network interfaces per instance Multiple IPs per interface serve-expired enabled 64

New DNS Design OS Configuration Two local cache services 127.0.0.1 routes to resolvers in same AZ 127.0.0.2 routes to resolvers in other AZs dnsmasq Configuration Max concurrency Max cache size /etc/resolv.conf points to: 127.0.0.1 127.0.0.2 direct resolver IP 65

New DNS Design 66

Lessons Learned / Remembered AWS main service model is pull, not push 67

Lessons Learned / Remembered AWS main service model is pull, not push Not all cloud provider limits are apparent - make sure they understand your business 68

Lessons Learned / Remembered AWS main service model is pull, not push Not all cloud provider limits are apparent - make sure they understand your business Instrument your support services - and protect them from each other 69

Lessons Learned / Remembered AWS main service model is pull, not push Not all cloud provider limits are apparent - make sure they understand your business Instrument your support services - and protect them from each other resolv.conf is not agile - not even eventually consistent 70

Lessons Learned / Remembered AWS main service model is pull, not push Not all cloud provider limits are apparent - make sure they understand your business Instrument your support services - and protect them from each other resolv.conf is not agile - not even eventually consistent Iteration doesn t solve it all 71

Lessons Learned / Remembered It s always a DNS problem 72

Lessons Learned / Remembered It s always a DNS problem - unless it s a firewall problem 73

References https://d1.awsstatic.com/whitepapers/hybrid-cloud-dns-options-for-v pc.pdf https://docs.aws.amazon.com/awsec2/latest/userguide/using-net work-security.html#security-group-connection-tracking http://unbound.net/ https://docs.aws.amazon.com/amazonvpc/latest/userguide/vpc-d ns.html http://www.thekelleys.org.uk/dnsmasq/doc.html 74

Questions? jblosser@sparkpost.com 75