Causal Modeling of Observational Cost Data: A Ground-Breaking use of Directed Acyclic Graphs

Similar documents
ARINC653 AADL Annex Update

Software, Security, and Resiliency. Paul Nielsen SEI Director and CEO

Advancing Cyber Intelligence Practices Through the SEI s Consortium

Prioritizing Alerts from Static Analysis with Classification Models

ARINC653 AADL Annex. Software Engineering Institute Carnegie Mellon University Pittsburgh, PA Julien Delange 07/08/2013

Roles and Responsibilities on DevOps Adoption

SEI/CMU Efforts on Assured Systems

Software Assurance Education Overview

Fall 2014 SEI Research Review Verifying Evolving Software

Passive Detection of Misbehaving Name Servers

Model-Driven Verifying Compilation of Synchronous Distributed Applications

Julia Allen Principal Researcher, CERT Division

Panel: Future of Cloud Computing

Verifying Periodic Programs with Priority Inheritance Locks

Be Like Water: Applying Analytical Adaptability to Cyber Intelligence

Defining Computer Security Incident Response Teams

Analyzing 24 Years of CVD

Investigating APT1. Software Engineering Institute Carnegie Mellon University Pittsburgh, PA Deana Shick and Angela Horneman

Automated Provisioning of Cloud and Cloudlet Applications

Inference of Memory Bounds

The CERT Top 10 List for Winning the Battle Against Insider Threats

Modeling the Implementation of Stated-Based System Architectures

Design Pattern Recovery from Malware Binaries

OSATE Analysis Support

Providing Information Superiority to Small Tactical Units

Engineering Improvement in Software Assurance: A Landscape Framework

Denial of Service Attacks

Researching New Ways to Build a Cybersecurity Workforce

Encounter Complexes For Clustering Network Flow

Flow Analysis for Network Situational Awareness. Tim Shimeall January Carnegie Mellon University

Cyber Hygiene: A Baseline Set of Practices

Cyber Threat Prioritization

Smart Grid Maturity Model

Evaluating and Improving Cybersecurity Capabilities of the Electricity Critical Infrastructure

Foundations for Summarizing and Learning Latent Structure in Video

Information Security Is a Business

Using DidFail to Analyze Flow of Sensitive Information in Sets of Android Apps

Situational Awareness Metrics from Flow and Other Data Sources

Static Analysis Alert Audits Lexicon And Rules David Svoboda, CERT Lori Flynn, CERT Presenter: Will Snavely, CERT

Model-Driven Verifying Compilation of Synchronous Distributed Applications

2013 US State of Cybercrime Survey

10 Years of FloCon. Prepared for FloCon George Warnagiris - CERT/CC #GeoWarnagiris Carnegie Mellon University

Cloud Computing. Grace A. Lewis Research, Technology and Systems Solutions (RTSS) Program System of Systems Practice (SoSP) Initiative

Collaborative Autonomy with Group Autonomy for Mobile Systems (GAMS)

Goal-Based Assessment for the Cybersecurity of Critical Infrastructure

Report Writer and Security Requirements Finder: User and Admin Manuals

Components and Considerations in Building an Insider Threat Program

Open Systems: What s Old Is New Again

Semantic Importance Sampling for Statistical Model Checking

Measuring the Software Security Requirements Engineering Process

Effecting Large-Scale Adaptive Swarms Through Intelligent Collaboration (ELASTIC)

Using CERT-RMM in a Software and System Assurance Context

The Need for Operational and Cyber Resilience in Transportation Systems

Current Threat Environment

Improving Software Assurance 1

Introduction to DAGs Directed Acyclic Graphs

Fall 2014 SEI Research Review FY14-03 Software Assurance Engineering

NO WARRANTY. Use of any trademarks in this presentation is not intended in any way to infringe on the rights of the trademark holder.

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

Dr. Kenneth E. Nidiffer Director of Strategic Plans for Government Programs

Architectural Implications of Cloud Computing

Pharos Static Analysis Framework

TUNISIA CSIRT CASE STUDY

Biological Material Transfer Agreement. between (PROVIDER) and. Date: A. Specific Terms of Agreement (Implementing Section)

Cloud Computing. Grace A. Lewis Research, Technology and Systems Solutions (RTSS) Program System of Systems Practice (SoSP) Initiative

The Insider Threat Center: Thwarting the Evil Insider

Secure Coding Initiative

Engineering High- Assurance Software for Distributed Adaptive Real- Time Systems

Integrating the Risk Management Framework (RMF) with DevOps

SEI Research Program. Dr. Kevin Fall Deputy Director, Research, and CTO SSC Carnegie Mellon University

Modeling, Verifying, and Generating Software for Distributed Cyber- Physical Systems using DMPL and AADL

DOS AND DON'TS OF DEVSECOPS

GraphBLAS: A Programming Specification for Graph Analysis

An Incident Management Ontology

Causal Models for Scientific Discovery

Elements of a Usability Reasoning Framework

Five Keys to Agile Test Automation for Government Programs

Carnegie Mellon University Notice

The Priority Ceiling Protocol: A Method for Minimizing the Blocking of High-Priority Ada Tasks

Hands-On Graphical Causal Modeling Using R

NISPOM Change 2: Considerations for Building an Effective Insider Threat Program

Netflow in Daily Information Security Operations

SEI Webinar Series. Software Engineering Institute Carnegie Mellon University Pittsburgh, PA January 27, Carnegie Mellon University

Automated Code Generation for High-Performance, Future-Compatible Graph Libraries

INCLUDING MEDICAL ADVICE DISCLAIMER

The Confluence of Physical and Cyber Security Management

Time-Bounded Analysis of Real- Time Systems

TSP Secure. Software Engineering Institute Carnegie Mellon University Pittsburgh, PA September 2009

Bridging The Gap Between Industry And Academia

EXPLORING CAUSAL RELATIONS IN DATA MINING BY USING DIRECTED ACYCLIC GRAPHS (DAG)

Strip Plots: A Simple Automated Time-Series Visualization

SAME Standard Package Installation Guide

Well There s Your Problem: Isolating the Crash-Inducing Bits in a Fuzzed File

ALAMO: Automatic Learning of Algebraic Models for Optimization

Strengthening Ties Between Process and Security

Node Aggregation for Distributed Inference in Bayesian Networks

Graphical Models and Markov Blankets

FUZZY LOGIC WITH ENGINEERING APPLICATIONS

Site Impact Policies for Website Use

Ordering attributes for missing values prediction and data classification

Transcription:

use Causal Modeling of Observational Cost Data: A Ground-Breaking use of Directed Acyclic Graphs Bob Stoddard Mike Konrad SEMA SEMA November 17, 2015 Public Release; Distribution is Copyright 2015 Carnegie Mellon University This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Department of Defense. References herein to any specific commercial product, process, or service by trade name, trade mark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute. NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN AS-IS BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT. [Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-us Government use and distribution. This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at permission@sei.cmu.edu. Carnegie Mellon is registered in the U.S. Patent and Trademark Office by Carnegie Mellon University. DM-0003059 2 Public Release; Distribution is 1

use Agenda Problem of Developing CERs 1 Why Causation instead of Correlation Causal Modeling using DAGs 2 Examples Call for Action and Collaboration 1 Cost Estimating Relationships 2 Directed Acyclic Graphs 3 Problem of Developing CERs Many CERs are built using traditional correlation and statistical regression modeling However, serious concerns exist in using these methods for the development of CERs, namely: What if other factors not represented in the model are responsible for the cost effects? What if there are convoluted factors impacting cost? What if cost analysts decide to interpret the regression coefficients as the degree of influence on cost? How do cost analysts confidently know that the CER parameters influence cost as compared to other factors that are correlated with these parameters? 4 Public Release; Distribution is 2

use Agenda Problem of Developing CERs Why Causation instead of Correlation Causal Modeling using DAGs Examples Call for Action and Collaboration 5 Why Traditional Correlation Falls Short Los Angeles Times May 12, 2014 http://www.latimes.com/business/hiltzik/la-fi-mh-seecorrelation-is-not-causation-20140512-column.html 6 Public Release; Distribution is 3

use Why Causal Modeling is a Game Changer 7 Causal Modeling Dr. Judea Pearl 8 Public Release; Distribution is 4

use Quotes by Judea Pearl I see no greater impediment to scientific progress than the prevailing practice of focusing all of our mathematical resources on probabilistic and statistical inferences while leaving causal considerations to the mercy of intuition and good judgment. Pearl, J. (2009). Causality. Cambridge university press. (Preface to 1 st Edition) The development of Bayesian Networks, so people tell me, marked a turning point in the way uncertainty is handled in computer systems. For me, this development was a stepping stone towards a more profound transition, from reasoning about beliefs to reasoning about causal and counterfactual relationships. Judea Pearl: From Bayesian Networks to Causal and Counterfactual Reasoning Keynote Lecture at the 2014 BayesiaLab User Conference Recorded on September 24, 2014, in Los Angeles. 9 Causal Modeling Dr. Stephen Morgan 10 Public Release; Distribution is 5

use CMU Causal Modeling Researchers-01 11 CMU Causal Modeling Researchers-02 12 Public Release; Distribution is 6

use Causal Inference with Directed Graphs Training 2-Day Seminar offered by Dr. Felix Elwert, Univ of Wisconsin Available through two channels: Statistical Horizons www.statisticalhorizons.com BayesiaLab http://www.bayesia.us/causal-inferencecourse-fairfax 13 Agenda Problem of Developing CERs Why Causation instead of Correlation Causal Modeling using DAGs Examples Call for Action and Collaboration 14 Public Release; Distribution is 7

use Landscape of Causal Modeling Identity of true causal parameters of cost Raw Observational Data Statistical Discovery of Causal Relationships To create the DAG (CMU Faculty) Quantifying Causal Relations using DAG graph surgery and Instrumental Variables (Pearl & Elwert) 15 Use of Directed, Acyclic Graphs 1. Derive testable implications of a causal model to evaluate if the model is correct 2. Understand causal identification requirements to confirm whether causality may be extracted from the data Separating causal from spurious associations in the data 3. Inform use of traditional statistical techniques such as regression Deciding which control variables to include versus not to include in the analysis to achieve identification of causality 16 Public Release; Distribution is 8

use Basic Concepts of DAGs 1. DAGs consist of: a) nodes (variables), b) directed arrows (possible causal relationships ordered by time), and c) missing arrows (confident assumptions about absence of causal effects 2. DAGs are nonparametric a) No distributional assumptions b) Linear and/or nonlinear 3. DAGs have both causal paths and non-causal (spurious) paths 17 Three Structures Studied in a DAG 1. Indirect Connection 2. Common Cause 3. Common Effect (Collider) 18 Public Release; Distribution is 9

use Deriving Testable Implications of a DAG 1. Uses a technique called d-separation a) Algorithm to help determine which paths are causal versus noncausal b) Uses concept of blocking a path to stop transmission of noncausal association 2. Additional techniques employed include a) Graphical identification b) Adjustment Criterion c) Backdoor Criterion d) Frontdoor Criterion e) Pearl s do-calculus 19 Blocking or Adjusting Paths 1. Controlling a variable 2. Stratifying a variable 3. Setting evidence on a variable 4. Observing a variable 5. Matching a variable (eg making distributions of sub-populations as similar as possible for comparison) 20 Public Release; Distribution is 10

use Agenda Problem of Developing CERs Why Causation instead of Correlation Causal Modeling using DAGs Examples Call for Action and Collaboration 21 Example: Causality Modeling with BayesiaLab Excerpts taken from: 22 Public Release; Distribution is 11

use 23 24 Public Release; Distribution is 12

use 25 26 Public Release; Distribution is 13

use 27 28 Public Release; Distribution is 14

use 29 30 Public Release; Distribution is 15

use 31 32 Public Release; Distribution is 16

use 33 Cost Estimation Example Use the CMU tool, Tetrad, to discover causal parameters in a data set containing a wide variety of factors deemed relevant to cost, or Hypothesize a set of factors related to cost, along with their hypothesized interrelationships, followed by causal modeling using Pearl graph surgery or instrumental variable analysis using Stata Factors may relate to existing cost parameters as well as factors related to new or emergent cost influences, such as Agile and DevOps 34 Public Release; Distribution is 17

use Agenda Problem of Developing CERs Why Causation instead of Correlation Causal Modeling using DAGs Examples Call for Action and Collaboration 35 Call for Action and Collaboration Causal modeling with observational data is practical Causal modeling informs which variables to include in experimental research You should consider building causal methodology into your CER development Practical methods and tooling now exist to discover (Tetrad) and model (Tetrad, Stata) causal relationships in data We (SEI) seek to partner with you in developing CERs by applying causal methods to your data 36 Public Release; Distribution is 18

use Contact Information Points of Contact SEMA Cost Estimation Research Group Robert Stoddard rws@sei.cmu.edu Mike Konrad mdk@sei.cmu.edu U.S. Mail Software Engineering Institute Customer Relations 4500 Fifth Avenue Pittsburgh, PA 15213-2612, USA Web www.sei.cmu.edu www.sei.cmu.edu/contact.cfm Customer Relations Email: info@sei.cmu.edu Telephone: +1 412-268-5800 SEI Phone: +1 412-268-5800 SEI Fax: +1 412-268-6257 37 Public Release; Distribution is 19