use Causal Modeling of Observational Cost Data: A Ground-Breaking use of Directed Acyclic Graphs Bob Stoddard Mike Konrad SEMA SEMA November 17, 2015 Public Release; Distribution is Copyright 2015 Carnegie Mellon University This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Department of Defense. References herein to any specific commercial product, process, or service by trade name, trade mark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute. NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN AS-IS BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT. [Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-us Government use and distribution. This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at permission@sei.cmu.edu. Carnegie Mellon is registered in the U.S. Patent and Trademark Office by Carnegie Mellon University. DM-0003059 2 Public Release; Distribution is 1
use Agenda Problem of Developing CERs 1 Why Causation instead of Correlation Causal Modeling using DAGs 2 Examples Call for Action and Collaboration 1 Cost Estimating Relationships 2 Directed Acyclic Graphs 3 Problem of Developing CERs Many CERs are built using traditional correlation and statistical regression modeling However, serious concerns exist in using these methods for the development of CERs, namely: What if other factors not represented in the model are responsible for the cost effects? What if there are convoluted factors impacting cost? What if cost analysts decide to interpret the regression coefficients as the degree of influence on cost? How do cost analysts confidently know that the CER parameters influence cost as compared to other factors that are correlated with these parameters? 4 Public Release; Distribution is 2
use Agenda Problem of Developing CERs Why Causation instead of Correlation Causal Modeling using DAGs Examples Call for Action and Collaboration 5 Why Traditional Correlation Falls Short Los Angeles Times May 12, 2014 http://www.latimes.com/business/hiltzik/la-fi-mh-seecorrelation-is-not-causation-20140512-column.html 6 Public Release; Distribution is 3
use Why Causal Modeling is a Game Changer 7 Causal Modeling Dr. Judea Pearl 8 Public Release; Distribution is 4
use Quotes by Judea Pearl I see no greater impediment to scientific progress than the prevailing practice of focusing all of our mathematical resources on probabilistic and statistical inferences while leaving causal considerations to the mercy of intuition and good judgment. Pearl, J. (2009). Causality. Cambridge university press. (Preface to 1 st Edition) The development of Bayesian Networks, so people tell me, marked a turning point in the way uncertainty is handled in computer systems. For me, this development was a stepping stone towards a more profound transition, from reasoning about beliefs to reasoning about causal and counterfactual relationships. Judea Pearl: From Bayesian Networks to Causal and Counterfactual Reasoning Keynote Lecture at the 2014 BayesiaLab User Conference Recorded on September 24, 2014, in Los Angeles. 9 Causal Modeling Dr. Stephen Morgan 10 Public Release; Distribution is 5
use CMU Causal Modeling Researchers-01 11 CMU Causal Modeling Researchers-02 12 Public Release; Distribution is 6
use Causal Inference with Directed Graphs Training 2-Day Seminar offered by Dr. Felix Elwert, Univ of Wisconsin Available through two channels: Statistical Horizons www.statisticalhorizons.com BayesiaLab http://www.bayesia.us/causal-inferencecourse-fairfax 13 Agenda Problem of Developing CERs Why Causation instead of Correlation Causal Modeling using DAGs Examples Call for Action and Collaboration 14 Public Release; Distribution is 7
use Landscape of Causal Modeling Identity of true causal parameters of cost Raw Observational Data Statistical Discovery of Causal Relationships To create the DAG (CMU Faculty) Quantifying Causal Relations using DAG graph surgery and Instrumental Variables (Pearl & Elwert) 15 Use of Directed, Acyclic Graphs 1. Derive testable implications of a causal model to evaluate if the model is correct 2. Understand causal identification requirements to confirm whether causality may be extracted from the data Separating causal from spurious associations in the data 3. Inform use of traditional statistical techniques such as regression Deciding which control variables to include versus not to include in the analysis to achieve identification of causality 16 Public Release; Distribution is 8
use Basic Concepts of DAGs 1. DAGs consist of: a) nodes (variables), b) directed arrows (possible causal relationships ordered by time), and c) missing arrows (confident assumptions about absence of causal effects 2. DAGs are nonparametric a) No distributional assumptions b) Linear and/or nonlinear 3. DAGs have both causal paths and non-causal (spurious) paths 17 Three Structures Studied in a DAG 1. Indirect Connection 2. Common Cause 3. Common Effect (Collider) 18 Public Release; Distribution is 9
use Deriving Testable Implications of a DAG 1. Uses a technique called d-separation a) Algorithm to help determine which paths are causal versus noncausal b) Uses concept of blocking a path to stop transmission of noncausal association 2. Additional techniques employed include a) Graphical identification b) Adjustment Criterion c) Backdoor Criterion d) Frontdoor Criterion e) Pearl s do-calculus 19 Blocking or Adjusting Paths 1. Controlling a variable 2. Stratifying a variable 3. Setting evidence on a variable 4. Observing a variable 5. Matching a variable (eg making distributions of sub-populations as similar as possible for comparison) 20 Public Release; Distribution is 10
use Agenda Problem of Developing CERs Why Causation instead of Correlation Causal Modeling using DAGs Examples Call for Action and Collaboration 21 Example: Causality Modeling with BayesiaLab Excerpts taken from: 22 Public Release; Distribution is 11
use 23 24 Public Release; Distribution is 12
use 25 26 Public Release; Distribution is 13
use 27 28 Public Release; Distribution is 14
use 29 30 Public Release; Distribution is 15
use 31 32 Public Release; Distribution is 16
use 33 Cost Estimation Example Use the CMU tool, Tetrad, to discover causal parameters in a data set containing a wide variety of factors deemed relevant to cost, or Hypothesize a set of factors related to cost, along with their hypothesized interrelationships, followed by causal modeling using Pearl graph surgery or instrumental variable analysis using Stata Factors may relate to existing cost parameters as well as factors related to new or emergent cost influences, such as Agile and DevOps 34 Public Release; Distribution is 17
use Agenda Problem of Developing CERs Why Causation instead of Correlation Causal Modeling using DAGs Examples Call for Action and Collaboration 35 Call for Action and Collaboration Causal modeling with observational data is practical Causal modeling informs which variables to include in experimental research You should consider building causal methodology into your CER development Practical methods and tooling now exist to discover (Tetrad) and model (Tetrad, Stata) causal relationships in data We (SEI) seek to partner with you in developing CERs by applying causal methods to your data 36 Public Release; Distribution is 18
use Contact Information Points of Contact SEMA Cost Estimation Research Group Robert Stoddard rws@sei.cmu.edu Mike Konrad mdk@sei.cmu.edu U.S. Mail Software Engineering Institute Customer Relations 4500 Fifth Avenue Pittsburgh, PA 15213-2612, USA Web www.sei.cmu.edu www.sei.cmu.edu/contact.cfm Customer Relations Email: info@sei.cmu.edu Telephone: +1 412-268-5800 SEI Phone: +1 412-268-5800 SEI Fax: +1 412-268-6257 37 Public Release; Distribution is 19