Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets

Similar documents
Tansu Alpcan C. Bauckhage S. Agarwal

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Applying Auto-Data Classification Techniques for Large Data Sets

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

Data Center Consolidation and Migration Made Simpler with Visibility

Monitoring the Device

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

SECURITY RISK METRICS: THE VIEW FROM THE TRENCHES. Alain Mayer CTO, RedSeal Systems

BMC Remedyforce Discovery and Client Management. Frequently asked questions

Information Extraction Techniques in Terrorism Surveillance

Introduction to Text Mining. Hongning Wang

Financial Events Recognition in Web News for Algorithmic Trading


Chapter 27 Introduction to Information Retrieval and Web Search

WhatsConfigured for WhatsUp Gold 2016 User Guide

vcenter Operations Management Pack for NSX-vSphere

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

ArcGIS Enterprise: Advanced Topics in Administration. Thomas Edghill & Moginraj Mohandas

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Connect with Remedy: SmartIT: Social Event Manager Webinar Q&A

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Proof of Concept IBM CLOUD OBJECT

Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure

Advanced Monitoring of Dell Devices in Nagios Core Using Dell OpenManage Plug-in

Natural Language Processing as Key Component to Successful Information Products

Monitoring Standards for the Producers of Web Services Alexander Quang Truong

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval. (M&S Ch 15)

TISA Methodology Threat Intelligence Scoring and Analysis

Search Results Clustering in Polish: Evaluation of Carrot

NetPilot: Automating Datacenter Network Failure Mitigation

Information Retrieval

Towards Summarizing the Web of Entities

Large Scale Generation of Complex and Faulty PHP Test Cases

Database Level 100. Rohit Rahi November Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Hortonworks DataFlow. Accelerating Big Data Collection and DataFlow Management. A Hortonworks White Paper DECEMBER Hortonworks DataFlow

Data Preprocessing. Slides by: Shree Jaswal

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

VMware vsphere Clusters in Security Zones

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Fast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann

Juniper Care Plus Advanced Services Credits

Introducing VMware Validated Design Use Cases. Modified on 21 DEC 2017 VMware Validated Design 4.1

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

Enterprise Manager: Scalable Oracle Management

Is Your Web Application Really Secure? Ken Graf, Watchfire

CONTENTS OF THIS REPORT

vsan Security Zone Deployment First Published On: Last Updated On:

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Context Based Web Indexing For Semantic Web

Internet Services and Search Engines. Amin Vahdat CSE 123b May 2, 2006

Liferay Security Features Overview. How Liferay Approaches Security

Cloud Operations Using Microsoft Azure. Nikhil Shampur

EFFECTIVE SERVICE PROVIDER DDOS PROTECTION THAT SAVES DOLLARS AND MAKES SENSE

Atrium Webinar- What's new in ADDM Version 10

Best Practices for Alert Tuning. This white paper will provide best practices for alert tuning to ensure two related outcomes:

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Part I: Data Mining Foundations

Search Engines. Information Retrieval in Practice

Site Audit SpaceX

Vendor: CISCO. Exam Code: Exam Name: Cisco Troubleshooting and Maintaining Cisco IP Networks

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

Enhanced Threat Detection, Investigation, and Response

Karnendu Raja Pattanaik and Narayanan Devasikamani May 29, Effective QA Methodology for Enterprise Storage

LEARN READ ON TO MORE ABOUT:

Democratically Finding The Cause of Packet Drops

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Three things you should know about Automated Refactoring. When planning an application modernization strategy

Feature Comparison Summary

Qlik Sense Certification Exam Study Guide

Dynamics 365. for Finance and Operations, Enterprise edition (onpremises) system requirements

SIEMLESS THREAT DETECTION FOR AWS

Reinvent Your 2013 Security Management Strategy

Data Management Glossary

F5 BIG-IQ Centralized Management: Local Traffic & Network. Version 5.2

AAD - ASSET AND ANOMALY DETECTION DATASHEET

Goliath Technology Overview with MEDITECH Module

This course incorporates a variety of hands-on lab exercises allowing participants to put the lesson content into action.

Understanding Customer Problem Troubleshooting from Storage System Logs

Estimate performance and capacity requirements for InfoPath Forms Services 2010

Traffic Engineering with Forward Fault Correction

Vblock Architecture Accelerating Deployment of the Private Cloud

Always Keep IT Purely Simple

Netwrix Auditor. Release Notes. Version: 9.6 6/15/2018

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

Using Threat Analytics to Protect Privileged Access and Prevent Breaches

CA Test Data Manager Key Scenarios

Code42 Defines its Critical Capabilities Methodology

Control access to your super user accounts

Oracle Database 11g: Administration Workshop II

Precise Medication Extraction using Agile Text Mining

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Managing Oracle Real Application Clusters. An Oracle White Paper January 2002

Audio Conference Details. Conference Telephone Number: Participant Code:

Real-time Monitoring, Inventory and Change Tracking for. Track. Report. RESOLVE!

SEMANTIC NETWORK AND SEARCH IN VEHICLE ENGINEERING

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Domain-specific Concept-based Information Retrieval System

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

Transcription:

Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets Rahul Potharaju (Purdue University) Navendu Jain (Microsoft Research) Cristina Nita-Rotaru (Purdue University) April 3, 2013 NSDI 2013

Network Troubleshooting: The Big Picture Link Down Switch B Network Monitoring Operator Console Switch A Alert Trigger Log Ticket Datacenters Diaries written by operators during network troubleshooting Network Trouble Ticket 2

Goal: Automated Problem Inference from Trouble Tickets Inference Output Network trouble ticket Problems Activities Actions What problems were observed? E.g., reboot loops, switch failure What troubleshooting was done? E.g., check config, verify BGP routes What was the resolution? E.g., replace line card, reboot 3

Goal: Automated Problem Inference from Trouble Tickets Network trouble ticket Problems Activities Actions Inference Output What problems were observed? E.g., reboot loops, switch failure Key questions for network management [Q1]: Why is network redundancy ineffective? What troubleshooting was done? E.g., check config, verify BGP routes [Q2]: What are the top-k failing components? [Q3]: Are new devices more reliable? What was the resolution? E.g., replace line card, reboot 3

What Does a Ticket Contain? STRUCTURED FIELDS E.g., ticket title, problem type, priority etc. FREE-FORM TEXT E.g., operator notes, emails, device debug logs, etc. 5

Challenges in Analyzing Trouble Tickets Coarse-grained information Inaccurate or Incomplete: 69%-75% in 10K+ tickets in our study! Written in natural language Typos and ambiguity Grammatical errors Domain-specific terms E.g., DNS, DMZ, line card 6

Our Contributions Measurement study: 10K+ tickets logged from a large cloud provider (April 2010-12) Coarse-grained and inaccurate structured data in 69%-75% of the tickets Free-form natural language text comprising emails, IMs, device debug logs, etc. NetSieve: Combines NLP, knowledge discovery and ontology modeling in a novel way 1. Problems: Network entity and its state/condition e.g., firewall failure, firmware error 2. Activities: Steps performed during troubleshooting e.g., change config, verify routes 3. Actions: Resolution applied to mitigate the problem e.g., replace disk, reboot switch Achieves 83%-100% accuracy Evaluated using a domain-expert, hardware vendor tickets and a survey of operators 7

Roadmap Motivation Strawman Approaches to Analyze Free-form Text NetSieve: Semantic-based Approach Evaluation Conclusion 8

Strawman Approach To Analyze Free-form Text Strawman #1: Use NLP techniques Limitation: Work only on well-written text such as news-articles Strawman #2: Keyword selection Limitations: Ignores contextual semantics Strawman #3: Clustering tickets based on manual keyword selection Limitations: 1. Significant time and effort to build the keyword list 2. Limited coverage or risks becoming outdated as the network evolves 9

Inference Output Problems Network trouble ticket Activities Actions NetSieve: Semantic-based Approach to Do Problem Inference 10

NetSieve Architecture KNOWLEDGE BUILDING PHASE TROUBLE TICKET REPOSITORY 1 2 3 Repeated Phrase Extraction Goal: Find frequently occurring phrases Knowledge Discovery Goal: Find phrases important in the networking domain Ontology Modeling Goal: Semantic interpretation of the domain-specific phrases power supply unit is faulty access router inoperative run config script is to inform you that there <power supply unit is faulty> <access router inoperative> <run config script> ENTITY: power supply unit -> STATE: faulty ENTITY: access router -> CONDITION: inoperative ENTITY: config script -> ACTION: run 11

Step I: Repeated Phrase Extraction Goal: Find frequently occurring phrases Extracting all possible n-grams Challenges: Computationally expensive Fine-tuning numerous thresholds Not all n-grams are useful (noise) Approach: Trade completeness for speed and scalability Tickets TOKENIZE INTO SENTENCES Apply a data compression algorithm (LZW) to the input tickets DICTIONARY BUILT BY LZW (Encoder by-product) Compute frequency of phrases in the LZW dictionary using Aho-Corasick algorithm Repeated Phrases + Frequencies 12

Step II: Knowledge Discovery Goal: Find phrases important in the current domain to do problem inference Challenges: Filter meaningful phrases from noisy ones Expert-labeling is time-consuming Approach: (19M phrases 5.6K phrases) 1. Apply a pipeline of linguistic filters 2. Rank phrases by importance using information theoretic measures Phrase power disruption on access router key corruption due to expired certificate bad memory on server prior communication best regards informing you that Important? 13

Step III: Ontology Modeling Goal: Semantic interpretation of the extracted important phrases in the current domain Challenges: How to precisely define the meaning of domain-specific phrases and relationships between them? and replace line card OMIT???? Approach: 1. Define an ontology model 2. Tag phrases with classes from the ontology model 14

Step III: Ontology Modeling Entity Action Semantic Meaning Object that can be deployed or repaired e.g., flash memory, core router Behavior that can be caused upon an entity e.g., reboot, replace Condition Describes the state of an entity e.g., bit error, hung state and replace line card DOMAIN- EXPERT and replace line card 5.6K phrases Split Phrases OMIT???? Tag using ontology class 0.6K phrases OMIT Action Entity 15

Putting it All Together (1/2): Tagging a Ticket Tokenize into sentences Find domain-specific phrases Tag with Ontology Classes We have raised a request #9646604 and found that the device xxx-xxx-xxx-130a Power LED is amber and it is in hung state. We checked the device for connectivity issues, cleaned the fiber and found that the power supply unit is faulty. We replaced the power supply unit. 15

Putting it All Together (1/2): Tagging a Ticket Tokenize into sentences Find domain-specific phrases Tag with Ontology Classes We have raised a request #9646604 and found that the device xxx-xxx-xxx-130a Power LED is amber and it is in hung state. We checked the device for connectivity issues, cleaned the fiber and found that the power supply unit is faulty. We replaced the power supply unit. 15

Putting it All Together (1/2): Tagging a Ticket Tokenize into sentences Find domain-specific phrases Tag with Ontology Classes We have raised a request #9646604 and found that the device xxx-xxx-xxx-130a Power LED is amber and it is in hung state. We checked the device for connectivity issues, cleaned the fiber and found that the power supply unit is faulty. We replaced the power supply unit. 15

Putting it All Together (1/2): Tagging a Ticket Tokenize into sentences Find domain-specific phrases Tag with Ontology Classes We have raised a request #9646604 and found that the (device)/replaceableentity xxx-xxx-xxx-130a (Power LED)/ReplaceableEntity is (amber)/condition and it is in (hung state)/problemcondition. We checked the device for connectivity issues, cleaned the fiber and found that the power supply unit is faulty. We replaced the power supply unit. 15

Putting it All Together (1/2): Tagging a Ticket Tokenize into sentences Find domain-specific phrases Tag with Ontology Classes We have raised a request #9646604 and found that the (device)/replaceableentity xxx-xxx-xxx-130a (Power LED)/ReplaceableEntity is (amber)/condition and it is in (hung state)/problemcondition. We (checked)/maintenanceaction the (device)/replaceableentity for (connectivity issues) /ProblemCondition, (cleaned)/maintenanceaction the (fiber)/replaceableentity and found that the (power supply unit)/replaceableentity is (faulty)/problemcondition. We replaced the power supply unit. 15

Putting it All Together (1/2): Tagging a Ticket Tokenize into sentences Find domain-specific phrases Tag with Ontology Classes We have raised a request #9646604 and found that the (device)/replaceableentity xxx-xxx-xxx-130a (Power LED)/ReplaceableEntity is (amber)/condition and it is in (hung state)/problemcondition. We (checked)/maintenanceaction the (device)/replaceableentity for (connectivity issues) /ProblemCondition, (cleaned)/maintenanceaction the (fiber)/replaceableentity and found that the (power supply unit)/replaceableentity is (faulty)/problemcondition. We (replaced)/physicalaction the (power supply unit)/replaceableentity. 15

Putting it All Together (1/2): Tagging a Ticket Tokenize into sentences Find domain-specific phrases Tag with Ontology Classes We have raised a request #9646604 and found that the (device)/replaceableentity xxx-xxx-xxx-130a (Power LED)/ReplaceableEntity is (amber)/condition and it is in (hung state)/problemcondition. We (checked)/maintenanceaction the (device)/replaceableentity for (connectivity issues) /ProblemCondition, (cleaned)/maintenanceaction the (fiber)/replaceableentity and found that the (power supply unit)/replaceableentity is (faulty)/problemcondition. We (replaced)/physicalaction the (power supply unit)/replaceableentity. 15

Putting it All Together (2/2): Information Inference Rule Inference Problems Entity precedes/succeeds ProblemCondition Activities Entity Condition precedes/succeeds MaintenanceAction Actions Entity precedes/succeeds PhysicalAction 16

Putting it All Together (2/2): Information Inference Rule Inference Problems Entity precedes/succeeds ProblemCondition Activities Entity Condition precedes/succeeds MaintenanceAction Actions Entity precedes/succeeds PhysicalAction We have raised a request #9646604 and found that the (device)/replaceableentity xxx-xxx-xxx-130a (Power LED)/ReplaceableEntity is (amber)/condition and it is in (hung state)/problemcondition. We (checked)/maintenanceaction the (device)/replaceableentity for (connectivity issues) /ProblemCondition, (cleaned)/maintenanceaction the (fiber)/replaceableentity and found that the (power supply unit)/replaceableentity is (faulty)/problemcondition. We (replaced)/physicalaction the (power supply unit)/replaceableentity. 16

Putting it All Together (2/2): Information Inference Rule Problems Entity precedes/succeeds ProblemCondition Activities Entity Condition precedes/succeeds MaintenanceAction Inference <device : hung state> <power supply unit : faulty> Actions Entity precedes/succeeds PhysicalAction We have raised a request #9646604 and found that the (device)/replaceableentity xxx-xxx-xxx-130a (Power LED)/ReplaceableEntity is (amber)/condition and it is in (hung state)/problemcondition. We (checked)/maintenanceaction the (device)/replaceableentity for (connectivity issues) /ProblemCondition, (cleaned)/maintenanceaction the (fiber)/replaceableentity and found that the (power supply unit)/replaceableentity is (faulty)/problemcondition. We (replaced)/physicalaction the (power supply unit)/replaceableentity. 16

Putting it All Together (2/2): Information Inference Rule Problems Entity precedes/succeeds ProblemCondition Activities Entity Condition precedes/succeeds MaintenanceAction Inference <device : hung state> <power supply unit : faulty> <connectivity issues : checked> <fiber : cleaned> Actions Entity precedes/succeeds PhysicalAction We have raised a request #9646604 and found that the (device)/replaceableentity xxx-xxx-xxx-130a (Power LED)/ReplaceableEntity is (amber)/condition and it is in (hung state)/problemcondition. We (checked)/maintenanceaction the (device)/replaceableentity for (connectivity issues) /ProblemCondition, (cleaned)/maintenanceaction the (fiber)/replaceableentity and found that the (power supply unit)/replaceableentity is (faulty)/problemcondition. We (replaced)/physicalaction the (power supply unit)/replaceableentity. 16

Putting it All Together (2/2): Information Inference Rule Problems Entity precedes/succeeds ProblemCondition Activities Entity Condition precedes/succeeds MaintenanceAction Inference <device : hung state> <power supply unit : faulty> <connectivity issues : checked> <fiber : cleaned> Actions Entity precedes/succeeds PhysicalAction <power supply unit : replace> We have raised a request #9646604 and found that the (device)/replaceableentity xxx-xxx-xxx-130a (Power LED)/ReplaceableEntity is (amber)/condition and it is in (hung state)/problemcondition. We (checked)/maintenanceaction the (device)/replaceableentity for (connectivity issues) /ProblemCondition, (cleaned)/maintenanceaction the (fiber)/replaceableentity and found that the (power supply unit)/replaceableentity is (faulty)/problemcondition. We (replaced)/physicalaction the (power supply unit)/replaceableentity. 16

NetSieve Evaluation 28

Evaluation Methodology Goals: Evaluate Accuracy and Usability Metrics: Percentage Accuracy, F-Score, Precision, Recall Time to read a ticket manually vs. NetSieve inference Dataset: 10K+ tickets Ground truth: 696 tickets labeled by an expert; 155 tickets from two network vendors Method: 1. Compare expert-labeled Problems and Actions with NetSieve inference 2. Survey of five operators each shown 20 tickets at random 29

Percentage Accuracy Evaluating Accuracy: Expert-labeled and Vendor Tickets 96%-100% accuracy for Problems; 89%-100% accuracy for Actions 105 100 95 90 100 100 100 100 97.7 97.1 96.4 97.3 Load Balancer-1Load Balancer-2Load Balancer-3Load Balancer-4 Firewalls Access Routers Load Balancer Core Routers Problems 110 100 90 80 96.6 100 100 95.7 94.3 95.1 98.7 89.6 Load Balancer-1Load Balancer-2Load Balancer-3Load Balancer-4 Firewalls Access Routers Load Balancer Core Routers Actions Expert-labeled (696 tickets) Vendor (155 tickets) 30

NetSieve Use Cases for Network Management Team Questions Findings 1 Capacity Planning Why is network redundancy ineffective? 1. Faulty cables 2. Software version mismatch 3. Misconfigurations 2 Incident Management What are the top-k failing components? 1. Line card failures 2. Defective memory 3. Supervisor engine 3 Network Architecture Are new devices more reliable? 1. A new access router is half as reliable as its predecessor 2. Software bugs dominated failures in one type of load balancers 31

Conclusion Goal: Automate problem inference from trouble tickets NetSieve semantic based approach Combines NLP, knowledge discovery and ontology modeling in a novel way Three key features: Problems, Activities and Actions Achieves an accuracy of 83%-100% over a large ticket dataset Future Work Build an ontology model automatically Improving accuracy using expert feedback Applying NetSieve to other problem domains 32

Poster & Demo session Tomorrow Evening! Project page http://netsieve.info Thank You for your time! 33