Anomaly Detection Fault Tolerance Anticipation

Size: px
Start display at page:

Download "Anomaly Detection Fault Tolerance Anticipation"

Transcription

1 Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon London 2012

2 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

3 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

4 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

5 Anomaly Detection

6 Anomaly Detection Getting at the state of health Evaluating the state of health Components AND systems

7 Supervisory Example: Active health check check_http Monitor Component (webserver) exit OK

8 Supervisory Monitor check_http Component (webserver) Pros: Easy to implement exit OK Easy to understand Well-known pattern Cons: Messaging can fail Scalability is limited

9 Supervisor Sensitivity 1 sec timeout 1 retry 1s 1s 3 sec interval X X 3s (7.9 sec exposure) Up to ~2.9s for the previous interval

10 Supervisor Sensitivity Request latency Schedule Latency (max = N) Monitor (max = 0.9s) check_http Component (webserver) exit OK Response latency (max = 0.9s)

11 Supervisor Sensitivity How many seconds of errors can you tolerate serving?

12 Supervisory Example: Interval Passive health check Monitor Component (webserver) exit 0 DISK consumption within bounds

13 Example: Interval Passive health check Pros: Supervisory Monitor exit 0 DISK consumption within bounds Component (webserver) Efficient Scalability is different Fewer moving parts Less exposure Can submit to multiple places Cons: Nonideal for network-based services Different tuning (windowed expectation)

14 Example: Passive health check Supervisory Interval { TIME?? Component

15 TIME Example: Passive health check Supervisory Interval {?? Interval Component Schedule Latency Exposure = (Schedule + Interval )*UnknownConsecutiveIntervals+1

16 Frequency and Transience Probability Of False Positives Probability Of Nondetection Short intervals Low # of retries Short timeouts Long intervals High # of retries Long timeouts

17 In-Line Example: Passive application event logging monitor application

18 Supervisory Example: Passive application event logging monitor application Pros: On-demand publish Cons: Onus is on the app Can t be 100% sure it s working

19 Supervisory Example: Passive application event logging monitor application Positive events (sales, registrations, etc.) Negative events (errors, exceptions, etc.) Lack or presence of data mean different things, so history is paramount.

20 Context

21 Evaluation what is abnormal?

22 10 9 Response Time Time

23 Static Thresholds 10 Response Time Critical Warning Time

24 Static Thresholds 10 Response Time Critical Warning Time

25 Static Thresholds 10 Response Time Critical Warning Time

26 Static Thresholds 10 Response Time Critical Warning Time

27 Static Thresholds

28 Static Thresholds

29 Context Normal?

30 Context 24 hours

31 Context 7 days

32 Normal But Noisy Context

33 Context Smoothing?

34 Context Holt-Winters Exponential Smoothing Recent points influencing a forecast, exponentially decreasing influence backwards in time. en.wikipedia.org/wiki/exponential_smoothing

35 Context Aberrant Behavior Detection in Time Series for Network Monitoring full_papers/brutlag/brutlag_html/

36 Dynamic Thresholds

37 Dynamic Thresholds Upper bound Raw data Lower bound

38 Dynamic Thresholds Hrm...

39 Dynamic Thresholds Hrm...

40 Dynamic Thresholds Holt-Winters Aberration Ah!

41 Dynamic Thresholds Graphite metrics collection w/holt-winters abberations Nagios check for Graphite data

42 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

43 FAULT TOLERANCE

44 Detection of fault X Triggers corrective action Y Clean up, report back (RECOVERY OR MASKING)

45 Variation Tolerance

46 Adaptive Systems Expected Variation

47 Adaptive Systems Expected Variation

48 Adaptive Systems Expected Variation

49 New Disturbances Arise Compensation is Exhausted Disturbance Expected Variation Control compensation decompensation Woods, 2011

50 New Disturbances Arise Compensation is Exhausted Disturbance Expected Variation Control compensation decompensation

51 New Disturbances Arise Compensation is Exhausted Variation Disturbance Expected Variation Fault Control compensation decompensation

52 Variations!= Faults

53 Dead Corrupt Late Wrong

54 Fault Tolerance Redundancy Spatial (server, network, process) Temporal (checkpoint, rollback ) Informational (data in N locations)

55 Fault Tolerance Redundancy Spatial (server, network, process) Temporal (checkpoint, rollback ) Informational (data in N locations)

56 Spatial Redundancy 2 2

57 Spatial Redundancy Active/Active

58 Spatial Redundancy Active/Passive

59 Spatial Redundancy Roaming Spare Dedicated Spare

60 In-Line Fault Tolerance PHP (thrift client) App Thrift Connect timeout Search (Lucene/Solr) Send timeout Receive timeout

61 App In-Line Fault Tolerance X Search (Lucene/Solr) 1. App attempts connection, can t 2. Caches APC user object with 60s TTL key=server:port 3. Moves to next server in rotation, skipping any found in APC

62 In-Line Fault Tolerance /lib/php/src/tsocketpool.php

63 In-Line Fault Tolerance Pros: Distributed checking and perspective Handles transient failures Auto-recovery Cons: Onus is on the app for implementation

64 Fault Tolerance Nagios Event Handlers Attempt to recover from specific conditions Chain together recovery actions eventhandlers.html

65 If (fault X) then HUP process; re-check If (OK) then notify+exit ELSE Hard restart process; re-check If (OK) then notify+exit ELSE Remove from production; notify+exit

66 How many seconds of errors can you tolerate serving?

67 Fail Closed When fault is found, and can t be recovered or masked, operations cease to protect the rest of the system from damage.

68 Depth and Dependencies Monitor Load Balancers Health check App DB

69 Depth and Dependencies WARNING: Monitor Load Balancers Health check Don t be too App crazy DB

70 Fail Closed Aggregate Cluster Checking X X X X If (clusterfail > 25%) then notify+exit ELSE OK

71 Fail Open When a fault happens, and can t be masked or recovered, operations continue without the feature.

72 Fail Open Example 1 at Etsy: Geo Targeting 50ms Internal SLA on guessing location via client IP. If >50ms, we just don t show local results.

73 Fail Open Example 2 at Etsy: Rate Limiting App Memcache Internal SLA on incrementing counters+checking totals. If >SLA, we let the action continue, and throw fire-andforget counter if we can.

74 SYSTEMIC

75 App Cache DB Search Logging Queue

76 App Cache DB Search Logging Queue

77 Functional Resonance

78

79 Shop Stats

80 Shop Stats App Cache DB Search Logging Queue

81 Registration App Cache DB Search Logging Queue

82 Registration

83 Shop Stats Logins Registrations Checkout New Listings Photos Search API Rate limiting Data Analysis Search A/B analysis Page performance Search Ads Editorial content systems Feedback Messaging/Convos Activity Feeds Circles Shipping Mobile Internationalization Testing Fraud

84 Systemic Application/Functionality Health Componential/Resource Health

85 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

86 Anticipation During design of architecture During choice of technologies During design of monitoring and metrics

87 TRADE-OFFS

88 What could possibly go wrong?

89 REQUISITE IMAGINATION

90 Possible Foreseeable Situations Situations Considered By Situations Considered By Situations Considered By Novice Designer Average Designer Expert Designer Adamski and Westrum, 2003

91 Anticipation Failure Mode Effects Analysis (FMEA) Failure Mode Effects and Criticality Analysis (FMECA) Failure_mode,_effects,_and_criticality_analysis

92 Architectural reviews Go or No-Go meetings Game Day exercises

93 Anticipation Servers Networks Software Applications Monitoring Metrics Traffic

94 PEOPLE

95 (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)

96 THE END

The Walking Dead Michael Nitschinger

The Walking Dead Michael Nitschinger The Walking Dead A Survival Guide to Resilient Reactive Applications Michael Nitschinger @daschl the right Mindset 2 The more you sweat in peace, the less you bleed in war. U.S. Marine Corps 3 4 5 Not

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

Rediffmail Enterprise High Availability Architecture

Rediffmail Enterprise High Availability Architecture Rediffmail Enterprise High Availability Architecture Introduction Rediffmail Enterprise has proven track record of 99.9%+ service availability. Multifold increase in number of users and introduction of

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

Scalable Online Analytics for Monitoring

Scalable Online Analytics for Monitoring Scalable Online Analytics for Monitoring LISA15, ov. 13, 2015, Washington, D.C. Heinrich Hartmann, PhD, Chief Data Scientist, Circonus I m Heinrich Heinrich.Hartmann@Circonus.com From Mainz, Germany Studied

More information

Trending with Purpose. Jason Dixon

Trending with Purpose. Jason Dixon Trending with Purpose Jason Dixon Monitoring Nagios Fault Detection Notifications Escalations Acknowledgements/Downtime http://www.nagios.org/ Nagios Pros Free Extensible Plugins Configuration templates

More information

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo Document Sub Title Yotpo Technical Overview 07/18/2016 2015 Yotpo Contents Introduction... 3 Yotpo Architecture... 4 Yotpo Back Office (or B2B)... 4 Yotpo On-Site Presence... 4 Technologies... 5 Real-Time

More information

Spark and Flink running scalable in Kubernetes Frank Conrad

Spark and Flink running scalable in Kubernetes Frank Conrad Spark and Flink running scalable in Kubernetes Frank Conrad Architect @ apomaya.com scalable efficient low latency processing 1 motivation, use case run (external, unknown trust) customer spark / flink

More information

The Future of Real-Time in Spark

The Future of Real-Time in Spark The Future of Real-Time in Spark Reynold Xin @rxin Spark Summit, New York, Feb 18, 2016 Why Real-Time? Making decisions faster is valuable. Preventing credit card fraud Monitoring industrial machinery

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

Outline. Failure Types

Outline. Failure Types Outline Database Tuning Nikolaus Augsten University of Salzburg Department of Computer Science Database Group 1 Unit 10 WS 2013/2014 Adapted from Database Tuning by Dennis Shasha and Philippe Bonnet. Nikolaus

More information

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues By Narjess Ayari, Denis Barbaron, Laurent Lefevre and Pascale primet Presented by Mingyu Liu Outlines 1.Introduction

More information

The Power of Snapshots Stateful Stream Processing with Apache Flink

The Power of Snapshots Stateful Stream Processing with Apache Flink The Power of Snapshots Stateful Stream Processing with Apache Flink Stephan Ewen QCon San Francisco, 2017 1 Original creators of Apache Flink da Platform 2 Open Source Apache Flink + da Application Manager

More information

Lecture 15: Datacenter TCP"

Lecture 15: Datacenter TCP Lecture 15: Datacenter TCP" CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Mohammad Alizadeh Lecture 15 Overview" Datacenter workload discussion DC-TCP Overview 2 Datacenter Review"

More information

ebay s Architectural Principles

ebay s Architectural Principles ebay s Architectural Principles Architectural Strategies, Patterns, and Forces for Scaling a Large ecommerce Site Randy Shoup ebay Distinguished Architect QCon London 2008 March 14, 2008 What we re up

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

Distributed Data Management Replication

Distributed Data Management Replication Felix Naumann F-2.03/F-2.04, Campus II Hasso Plattner Institut Distributing Data Motivation Scalability (Elasticity) If data volume, processing, or access exhausts one machine, you might want to spread

More information

Error code. Description of the circumstances under which the problem occurred. Less than 200. Linux system call error.

Error code. Description of the circumstances under which the problem occurred. Less than 200. Linux system call error. Error code Less than 200 Error code Error type Description of the circumstances under which the problem occurred Linux system call error. Explanation of possible causes Countermeasures 1001 CM_NO_MEMORY

More information

MERC. User Guide. For Magento 2.X. Version P a g e

MERC. User Guide. For Magento 2.X. Version P a g e MERC User Guide For Magento 2.X Version 1.0.0 http://litmus7.com/ 1 P a g e Table of Contents Table of Contents... 2 1. Introduction... 3 2. Requirements... 4 3. Installation... 4 4. Configuration... 4

More information

High Availability/ Clustering with Zend Platform

High Availability/ Clustering with Zend Platform High Availability/ Clustering with Zend Platform David Goulden Product Manager goulden@zend.com Copyright 2007, Zend Technologies Inc. In this Webcast Introduction to Web application scalability using

More information

Monitor your containers with the Elastic Stack. Monica Sarbu

Monitor your containers with the Elastic Stack. Monica Sarbu Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team monica@elastic.co 3 Monitor your containers with the Elastic Stack Elastic Stack 5 Beats are lightweight shippers

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

ebay Marketplace Architecture

ebay Marketplace Architecture ebay Marketplace Architecture Architectural Strategies, Patterns, and Forces Randy Shoup, ebay Distinguished Architect QCon SF 2007 November 9, 2007 What we re up against ebay manages Over 248,000,000

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead

More information

Implementation Issues. Remote-Write Protocols

Implementation Issues. Remote-Write Protocols Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write

More information

Rhapsody Interface Management and Administration

Rhapsody Interface Management and Administration Rhapsody Interface Management and Administration Welcome The Rhapsody Framework Rhapsody Processing Model Application and persistence store files Web Management Console Backups Route, communication and

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

Zero to Millions: Building an XLSP for Gears of War 2

Zero to Millions: Building an XLSP for Gears of War 2 Zero to Millions: Building an XLSP for Gears of War 2 Dan Schoenblum Senior Engine Programmer Epic Games dan.schoenblum@epicgames.com About Me Working in online gaming for over 10 years At GameSpy from

More information

All Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging

All Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging 3/14/2001 1 All Paging Schemes Depend on Locality VM Page Replacement Emin Gun Sirer Processes tend to reference pages in localized patterns Temporal locality» locations referenced recently likely to be

More information

Memory may be insufficient. Memory may be insufficient.

Memory may be insufficient. Memory may be insufficient. Error code Less than 200 Error code Error type Description of the circumstances under which the problem occurred Linux system call error. Explanation of possible causes Countermeasures 1001 CM_NO_MEMORY

More information

VMware vrealize operations Management Pack FOR. PostgreSQL. User Guide

VMware vrealize operations Management Pack FOR. PostgreSQL. User Guide VMware vrealize operations Management Pack FOR PostgreSQL User Guide TABLE OF CONTENTS 1. Purpose... 3 2. Introduction to the Management Pack... 3 2.1 How the Management Pack Collects Data... 3 2.2 Data

More information

Configuring IP SLAs LSP Health Monitor Operations

Configuring IP SLAs LSP Health Monitor Operations Configuring IP SLAs LSP Health Monitor Operations This module describes how to configure an IP Service Level Agreements (SLAs) label switched path (LSP) Health Monitor. LSP health monitors enable you to

More information

Appendix D: Storage Systems (Cont)

Appendix D: Storage Systems (Cont) Appendix D: Storage Systems (Cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Reliability, Availability, Dependability Dependability: deliver service such that

More information

Elastic Load Balance. User Guide. Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.

Elastic Load Balance. User Guide. Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 01 Date 2018-04-30 HUAWEI TECHNOLOGIES CO., LTD. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of

More information

How to pimp high volume PHP websites. 27. September 2008, PHP conference Barcelona. By Jens Bierkandt

How to pimp high volume PHP websites. 27. September 2008, PHP conference Barcelona. By Jens Bierkandt How to pimp high volume PHP websites 27. September 2008, PHP conference Barcelona By Jens Bierkandt 1 About me Jens Bierkandt Working with PHP since 2000 From Germany, living in Spain, speaking English

More information

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services

More information

IT323 - Software Engineering 2 1

IT323 - Software Engineering 2 1 IT323 - Software Engineering 2 1 Explain how standards may be used to capture organizational wisdom about effective methods of software development. Suggest four types of knowledge that might be captured

More information

Lecture 22: Fault Tolerance

Lecture 22: Fault Tolerance Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA 03, Wisconsin A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures, HPCA 07, Spain Error

More information

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1 EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint

More information

WLS Neue Optionen braucht das Land

WLS Neue Optionen braucht das Land WLS Neue Optionen braucht das Land Sören Halter Principal Sales Consultant 2016-11-16 Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Monitor your infrastructure with the Elastic Beats. Monica Sarbu

Monitor your infrastructure with the Elastic Beats. Monica Sarbu Monitor your infrastructure with the Elastic Beats Monica Sarbu Monica Sarbu Team lead, Beats team Email: monica@elastic.co Twitter: 2 Monitor your servers Apache logs 3 Monitor your servers Apache logs

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Fault Tolerance. Chapter 7

Fault Tolerance. Chapter 7 Fault Tolerance Chapter 7 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability Failure Models Type of failure Crash failure Omission failure Receive omission Send omission

More information

Operating Systems Virtual Memory. Lecture 11 Michael O Boyle

Operating Systems Virtual Memory. Lecture 11 Michael O Boyle Operating Systems Virtual Memory Lecture 11 Michael O Boyle 1 Paged virtual memory Allows a larger logical address space than physical memory All pages of address space do not need to be in memory the

More information

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation Trends in Data Protection and Restoration Technologies Mike Fishman, EMC 2 Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member

More information

CIT 668: System Architecture. Caching

CIT 668: System Architecture. Caching CIT 668: System Architecture Caching Topics 1. Cache Types 2. Web Caching 3. Replacement Algorithms 4. Distributed Caches 5. memcached A cache is a system component that stores data so that future requests

More information

Managing Latency in IPS Networks

Managing Latency in IPS Networks Revision C McAfee Network Security Platform (Managing Latency in IPS Networks) Managing Latency in IPS Networks McAfee Network Security Platform provides you with a set of pre-defined recommended settings

More information

Anti-DDoS. User Guide (Paris) Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.

Anti-DDoS. User Guide (Paris) Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 01 Date 2018-08-15 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any

More information

Help! I need more servers! What do I do?

Help! I need more servers! What do I do? Help! I need more servers! What do I do? Scaling a PHP application 1 2-Feb-09 Introduction A real world example The wonderful world of startups Who am I? 2 2-Feb-09 Presentation Overview - Scalability

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee @kleegeek davidklee.net gplus.to/kleegeek linked.com/a/davidaklee Specialties / Focus Areas / Passions: Performance Tuning & Troubleshooting Virtualization Cloud Enablement Infrastructure Architecture

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data

Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data FAST 2017, Santa Clara Florian Lautenschlager, Michael Philippsen, Andreas Kumlehn, and Josef Adersberger Florian.Lautenschlager@qaware.de

More information

Office and Express Print Release High Availability Setup Guide

Office and Express Print Release High Availability Setup Guide Office and Express Print Release High Availability Setup Guide Version 1.0 2017 EQ-HA-DCE-20170512 Print Release High Availability Setup Guide Document Revision History Revision Date May 12, 2017 September

More information

Vmware VCP550PSE. VMware Certified Professional on vsphere 5.

Vmware VCP550PSE. VMware Certified Professional on vsphere 5. Vmware VCP550PSE VMware Certified Professional on vsphere 5 http://killexams.com/exam-detail/vcp550pse QUESTION: 108 A virtual machine fails to migrate during a Storage DRS event. What could cause this

More information

Intrusion Detection Systems (IDS)

Intrusion Detection Systems (IDS) Intrusion Detection Systems (IDS) Presented by Erland Jonsson Department of Computer Science and Engineering Contents Motivation and basics (Why and what?) IDS types and detection principles Key Data Problems

More information

Hochperformante Softwarearchitekturen Planung, Zufall oder Erfahrung? Unser Geheimrezept!

Hochperformante Softwarearchitekturen Planung, Zufall oder Erfahrung? Unser Geheimrezept! Hochperformante Softwarearchitekturen Planung, Zufall oder Erfahrung? Unser Geheimrezept! Alexander Buchmann, Wolfgang Strunk Siemens Enterprise Communications SeaCon 2011 Hamburg, June 28th Page 1 Welcome

More information

A Guide to Architecting the Active/Active Data Center

A Guide to Architecting the Active/Active Data Center White Paper A Guide to Architecting the Active/Active Data Center 2015 ScaleArc. All Rights Reserved. White Paper The New Imperative: Architecting the Active/Active Data Center Introduction With the average

More information

CS 153 Design of Operating Systems Winter 2016

CS 153 Design of Operating Systems Winter 2016 CS 153 Design of Operating Systems Winter 2016 Lecture 18: Page Replacement Terminology in Paging A virtual page corresponds to physical page/frame Segment should not be used anywhere Page out = Page eviction

More information

Is Your Project in Trouble on System Performance?

Is Your Project in Trouble on System Performance? Is Your Project in Trouble on System Performance? Charles Chow May 2017 Is SATURN Your Project 2017 in Trouble - Is Your on System Project Performance? in Trouble on System Performance? May 2017 1 4, [Copyright

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Experience the GRID Today with Oracle9i RAC

Experience the GRID Today with Oracle9i RAC 1 Experience the GRID Today with Oracle9i RAC Shig Hiura Pre-Sales Engineer Shig_Hiura@etagon.com 2 Agenda Introduction What is the Grid The Database Grid Oracle9i RAC Technology 10g vs. 9iR2 Comparison

More information

ni.com Decisions Behind the Design: LabVIEW for CompactRIO Sample Projects

ni.com Decisions Behind the Design: LabVIEW for CompactRIO Sample Projects Decisions Behind the Design: LabVIEW for CompactRIO Sample Projects Agenda Keys to quality in a software architecture Software architecture overview I/O safe states Watchdog timers Message communication

More information

Designing Fault-Tolerant Applications

Designing Fault-Tolerant Applications Designing Fault-Tolerant Applications Miles Ward Enterprise Solutions Architect Building Fault-Tolerant Applications on AWS White paper published last year Sharing best practices We d like to hear your

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

Chapter 11. High Availability

Chapter 11. High Availability Chapter 11. High Availability This chapter describes the high availability fault-tolerance feature in D-Link Firewalls. Overview, page 289 High Availability Mechanisms, page 291 High Availability Setup,

More information

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm FAULT TOLERANT SYSTEMS Coordinated http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Chapter 6 II Uncoordinated checkpointing may lead to domino effect or to livelock Example: l P wants to take a

More information

Improve Web Application Performance with Zend Platform

Improve Web Application Performance with Zend Platform Improve Web Application Performance with Zend Platform Shahar Evron Zend Sr. PHP Specialist Copyright 2007, Zend Technologies Inc. Agenda Benchmark Setup Comprehensive Performance Multilayered Caching

More information

Google File System 2

Google File System 2 Google File System 2 goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) focus on multi-gb files handle appends efficiently (no random writes & sequential reads) co-design

More information

Facebook Immune System 人人安全中心姚海阔

Facebook Immune System 人人安全中心姚海阔 Facebook Immune System 人人安全中心姚海阔 Immune A realtime system to protect our users and the social graph Big data, Real time 25B checks per day 650K per second at peak Realtime checks and classifications on

More information

CSE 565 Computer Security Fall 2018

CSE 565 Computer Security Fall 2018 CSE 565 Computer Security Fall 2018 Lecture 19: Intrusion Detection Department of Computer Science and Engineering University at Buffalo 1 Lecture Outline Intruders Intrusion detection host-based network-based

More information

Putting together the platform: Riak, Redis, Solr and Spark. Bryan Hunt

Putting together the platform: Riak, Redis, Solr and Spark. Bryan Hunt Putting together the platform: Riak, Redis, Solr and Spark Bryan Hunt 1 $ whoami Bryan Hunt Client Services Engineer @binarytemple 2 Minimum viable product - the ideologically correct doctrine 1. Start

More information

jetnexus Virtual Load Balancer

jetnexus Virtual Load Balancer jetnexus Virtual Load Balancer Mitigate the Risk of Downtime and Optimise Application Delivery We were looking for a robust yet easy to use solution that would fit in with our virtualisation policy and

More information

Demystifying Storage Area Networks. Michael Wells Microsoft Application Solutions Specialist EMC Corporation

Demystifying Storage Area Networks. Michael Wells Microsoft Application Solutions Specialist EMC Corporation Demystifying Storage Area Networks Michael Wells Microsoft Application Solutions Specialist EMC Corporation About Me DBA for 7+ years Developer for 10+ years MCSE: Data Platform MCSE: SQL Server 2012 MCITP:

More information

jetnexus Virtual Load Balancer

jetnexus Virtual Load Balancer jetnexus Virtual Load Balancer Mitigate the Risk of Downtime and Optimise Application Delivery We were looking for a robust yet easy to use solution that would fit in with our virtualisation policy and

More information

Cloud Monitoring as a Service. Built On Machine Learning

Cloud Monitoring as a Service. Built On Machine Learning Cloud Monitoring as a Service Built On Machine Learning Table of Contents 1 2 3 4 5 6 7 8 9 10 Why Machine Learning Who Cares Four Dimensions to Cloud Monitoring Data Aggregation Anomaly Detection Algorithms

More information

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 24 Mass Storage, HDFS/Hadoop Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ What 2

More information

Reminder: Mechanics of address translation. Paged virtual memory. Reminder: Page Table Entries (PTEs) Demand paging. Page faults

Reminder: Mechanics of address translation. Paged virtual memory. Reminder: Page Table Entries (PTEs) Demand paging. Page faults CSE 451: Operating Systems Autumn 2012 Module 12 Virtual Memory, Page Faults, Demand Paging, and Page Replacement Reminder: Mechanics of address translation virtual address virtual # offset table frame

More information

Activant Solutions Inc. SQL Server 2005: Data Storage

Activant Solutions Inc. SQL Server 2005: Data Storage Activant Solutions Inc. SQL Server 2005: Data Storage SQL Server 2005 suite Course 2 of 4 This class is designed for Beginner/Intermediate SQL Server 2005 System Administrators Objectives Understand how

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

Fault-tolerant techniques

Fault-tolerant techniques What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques

More information

Cluster-Based Scalable Network Services

Cluster-Based Scalable Network Services Cluster-Based Scalable Network Services Suhas Uppalapati INFT 803 Oct 05 1999 (Source : Fox, Gribble, Chawathe, and Brewer, SOSP, 1997) Requirements for SNS Incremental scalability and overflow growth

More information

vcloud Automation Center Reference Architecture vcloud Automation Center 5.2

vcloud Automation Center Reference Architecture vcloud Automation Center 5.2 vcloud Automation Center Reference Architecture vcloud Automation Center 5.2 This document supports the version of each product listed and supports all subsequent versions until the document is replaced

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Microsoft SQL Server Fix Pack 15. Reference IBM

Microsoft SQL Server Fix Pack 15. Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Note Before using this information and the product it supports, read the information in Notices

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

CS 856 Latency in Communication Systems

CS 856 Latency in Communication Systems CS 856 Latency in Communication Systems Winter 2010 Latency Challenges CS 856, Winter 2010, Latency Challenges 1 Overview Sources of Latency low-level mechanisms services Application Requirements Latency

More information

Release 3.0 Insights Storage Operation and Maintenance Guide. Revision: 1.0

Release 3.0 Insights Storage Operation and Maintenance Guide. Revision: 1.0 Release 3.0 Insights Storage Operation and Maintenance Guide Revision: 1.0 Insights Storage Operation and Maintenance Guide Portions of the documents can be copied and pasted to your electronic mail or

More information

Analysis and Optimization. Carl Waldspurger Irfan Ahmad CloudPhysics, Inc.

Analysis and Optimization. Carl Waldspurger Irfan Ahmad CloudPhysics, Inc. PRESENTATION Practical Online TITLE GOES Cache HERE Analysis and Optimization Carl Waldspurger Irfan Ahmad CloudPhysics, Inc. SNIA Legal Notice The material contained in this tutorial is copyrighted by

More information

Technical Guide Network Video Recorder Enterprise Edition Maintenance Guide

Technical Guide Network Video Recorder Enterprise Edition Maintenance Guide Technical Guide Network Video Recorder Enterprise Edition Maintenance Guide Network Video Management System March 23, 2018 NVMSTG008 Revision 1.3.0 CONTENTS 1. Overview... 3 1.1. About This Document...

More information

RELIABILITY & AVAILABILITY IN THE CLOUD

RELIABILITY & AVAILABILITY IN THE CLOUD RELIABILITY & AVAILABILITY IN THE CLOUD A TWILIO PERSPECTIVE twilio.com To the leaders and engineers at Twilio, the cloud represents the promise of reliable, scalable infrastructure at a price that directly

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

ExBC software C5.7.X. User manual. Part no Issue no 04 Date 04/2018

ExBC software C5.7.X. User manual. Part no Issue no 04 Date 04/2018 Part no 6159921030 Issue no 04 Date 04/2018 ExBC software C5.7.X User manual Model Part number EABC15-900 6151658410 EABC26-560 6151658420 EABC32-410 6151658430 EABC45-330 6151658440 EABC50-450 6151658450

More information

Yogesh Simmhan. escience Group Microsoft Research

Yogesh Simmhan. escience Group Microsoft Research External Research Yogesh Simmhan Group Microsoft Research Catharine van Ingen, Roger Barga, Microsoft Research Alex Szalay, Johns Hopkins University Jim Heasley, University of Hawaii Science is producing

More information