Static Analysis of Dynamically Typed Languages made Easy

Similar documents
CS558 Programming Languages

Compilers and computer architecture: Semantic analysis

Interprocedural Analysis with Data-Dependent Calls. Circularity dilemma. A solution: optimistic iterative analysis. Example

Type Soundness. Type soundness is a theorem of the form If e : τ, then running e never produces an error

Interprocedural Analysis with Data-Dependent Calls. Circularity dilemma. A solution: optimistic iterative analysis. Example

Logic-Flow Analysis of Higher-Order Programs

Announcements. Working on requirements this week Work on design, implementation. Types. Lecture 17 CS 169. Outline. Java Types

CSCI 3155: Principles of Programming Languages Exam preparation #1 2007

Lecture #23: Conversion and Type Inference

Conversion vs. Subtyping. Lecture #23: Conversion and Type Inference. Integer Conversions. Conversions: Implicit vs. Explicit. Object x = "Hello";

Writing Evaluators MIF08. Laure Gonnord

Syntax and Grammars 1 / 21

CMSC 330: Organization of Programming Languages. OCaml Expressions and Functions

CSE 413 Languages & Implementation. Hal Perkins Winter 2019 Structs, Implementing Languages (credits: Dan Grossman, CSE 341)

Summer 2017 Discussion 10: July 25, Introduction. 2 Primitives and Define

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler Front-End

Operational Semantics. One-Slide Summary. Lecture Outline

CSE341: Programming Languages Lecture 17 Implementing Languages Including Closures. Dan Grossman Autumn 2018

Lecture Outline. COOL operational semantics. Operational Semantics of Cool. Motivation. Notation. The rules. Evaluation Rules So Far.

Compiler Optimisation

INF 102 CONCEPTS OF PROG. LANGS FUNCTIONAL COMPOSITION. Instructors: James Jones Copyright Instructors.

INF 212 ANALYSIS OF PROG. LANGS FUNCTION COMPOSITION. Instructors: Crista Lopes Copyright Instructors.

Encoding functions in FOL. Where we re at. Ideally, would like an IF stmt. Another example. Solution 1: write your own IF. No built-in IF in Simplify

Hard deadline: 3/28/15 1:00pm. Using software development tools like source control. Understanding the environment model and type inference.

Lecture Compiler Middle-End

Intro. Scheme Basics. scm> 5 5. scm>

CS152: Programming Languages. Lecture 11 STLC Extensions and Related Topics. Dan Grossman Spring 2011

CS558 Programming Languages

Parsing Scheme (+ (* 2 3) 1) * 1

CSCI-GA Scripting Languages

(Refer Slide Time: 4:00)

CS558 Programming Languages

Static Program Analysis Part 9 pointer analysis. Anders Møller & Michael I. Schwartzbach Computer Science, Aarhus University

CS 360 Programming Languages Interpreters

Project 5 Due 11:59:59pm Wed, July 16, 2014

Introducing Wybe a language for everyone

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

Soot A Java Bytecode Optimization Framework. Sable Research Group School of Computer Science McGill University

The role of semantic analysis in a compiler

django-dynamic-db-router Documentation

Collage of Static Analysis

CS558 Programming Languages

Objects CHAPTER 6. FIGURE 1. Concrete syntax for the P 3 subset of Python. (In addition to that of P 2.)

Dynamically-typed Languages. David Miller

Programs as data Parsing cont d; first-order functional language, type checking

Types. Type checking. Why Do We Need Type Systems? Types and Operations. What is a type? Consensus

Guido van Rossum 9th LASER summer school, Sept. 2012

7. Introduction to Denotational Semantics. Oscar Nierstrasz

Beyond Blocks: Python Session #1

Project 6 Due 11:59:59pm Thu, Dec 10, 2015

Project 5 Due 11:59:59pm Wed, Nov 25, 2015 (no late submissions)

Object Model Comparisons

The design of a programming language for provably correct programs: success and failure

Programs as data first-order functional language type checking

Key components of a lang. Deconstructing OCaml. In OCaml. Units of computation. In Java/Python. In OCaml. Units of computation.

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

Types and Type Inference

A First Look at ML. Chapter Five Modern Programming Languages, 2nd ed. 1

Python Implementation Strategies. Jeremy Hylton Python / Google

Lecture Outline. COOL operational semantics. Operational Semantics of Cool. Motivation. Lecture 13. Notation. The rules. Evaluation Rules So Far

CSE 431S Final Review. Washington University Spring 2013

COMP 181. Agenda. Midterm topics. Today: type checking. Purpose of types. Type errors. Type checking

Type Checking and Type Inference

Project 5 Due 11:59:59pm Tuesday, April 25, 2017

Exceptions & a Taste of Declarative Programming in SQL

Semantic Processing (Part 2)

Overloading, Type Classes, and Algebraic Datatypes

Types and Type Inference

Compilers. Type checking. Yannis Smaragdakis, U. Athens (original slides by Sam

Introduction to Functional Programming in Haskell 1 / 56

Discussion 12 The MCE (solutions)

CS153: Compilers Lecture 14: Type Checking

Values (a.k.a. data) representation. Advanced Compiler Construction Michel Schinz

Values (a.k.a. data) representation. The problem. Values representation. The problem. Advanced Compiler Construction Michel Schinz

CSCE 314 Programming Languages. Type System

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler so far

SCHEME 7. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. October 29, 2015

CSE413: Programming Languages and Implementation Racket structs Implementing languages with interpreters Implementing closures

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

Verified compilers. Guest lecture for Compiler Construction, Spring Magnus Myréen. Chalmers University of Technology

A Practical Optional Type System for Clojure. Ambrose Bonnaire-Sergeant

Operational Semantics of Cool

CA Compiler Construction

CS558 Programming Languages

Typical workflow. CSE341: Programming Languages. Lecture 17 Implementing Languages Including Closures. Reality more complicated

Intrinsically Typed Reflection of a Gallina Subset Supporting Dependent Types for Non-structural Recursion of Coq

OOPLs - call graph construction. Example executed calls

CSC312 Principles of Programming Languages : Functional Programming Language. Copyright 2006 The McGraw-Hill Companies, Inc.

Type Inference. Prof. Clarkson Fall Today s music: Cool, Calm, and Collected by The Rolling Stones

Semantic Processing (Part 2)

CMSC 330: Organization of Programming Languages

Lecture #12: Quick: Exceptions and SQL

CSE 341: Programming Languages

Project 5 Due 11:59:59pm Tue, Dec 11, 2012

Outline. Introduction Concepts and terminology The case for static typing. Implementing a static type system Basic typing relations Adding context

CS61A Lecture 38. Robert Huang UC Berkeley April 17, 2013

Project 4 Due 11:59:59pm Thu, May 10, 2012

CS-XXX: Graduate Programming Languages. Lecture 9 Simply Typed Lambda Calculus. Dan Grossman 2012

Test I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

Simon Peyton Jones Microsoft Research August 2013

Transcription:

Static Analysis of Dynamically Typed Languages made Easy Yin Wang School of Informatics and Computing Indiana University

Overview Work done as two internships at Google (2009 summer and 2010 summer) Motivation: The Grok Project: static analysis of all code at Google (C++, Java, JavaScript, Python, Sawzall, Protobuf ) Initial goal was not ambitious: Implement IDE- like code- browsing Turns out to be hard for Python

Achieved Goals Build high- accuracy semantic indexes Detect and report semantic bugs type errors missing return statement unreachable code

Demo Time Go

Problems Faced by Static Analysis of Dynamically Typed Languages

1. Problems with Dynamic Typing Dynamic typing makes it hard to resolve some names Mostly happen in polymorphic functions def h(x): return x.z

1. Problems with Dynamic Typing Dynamic typing makes it hard to resolve some names Mostly happen in polymorphic functions def h(x): return x.z Q: Where is z defined? A: wherever we defined x

1. Problems with Dynamic Typing Dynamic typing makes it hard to resolve some names Mostly happen in polymorphic functions def h(x): return x.z Q: What is x? A: Uhh Q: Where is z defined? A: wherever we defined x

1. Problems with Dynamic Typing Dynamic typing makes it hard to resolve some names Mostly happen in polymorphic functions def h(x): return x.z Q: What is x? A: Uhh Solution: use a static type system use inter- procedural analysis to infer types Q: Where is z defined? A: wherever we defined x

Static Type System for Python Mostly a usual type system, with two extras: union and dict primitive types int, str, float, bool tuple types (int,float), (A, B, C) list types [int], [bool], [(int,bool)] dict types {int => str}, {A => B} class types ClassA, ClassB union types {int str}, {A B C} recursive types #1(int, 1), #2(int - > 2) function types int - > bool, A - > B

2. Problems with Control- Flow Graph CFGs are tricky to build for high- order programs Attempts to build CFGs have led to complications and limitations in control- flow analysis Shivers 1988, 1991 build CFG after CPS Might & Shivers 2006,2007 solve problems introduced by CFG Vardoulakis & Shivers 2010,2011 solve problems introduced by CPS def g(f,x): return f(x) def h1(x): return x+1 def h2(x): return x+2

2. Problems with Control- Flow Graph Where is the CFG target? CFGs are tricky to build for high- order programs Attempts to build CFGs have led to complications and limitations in control- flow analysis Shivers 1988, 1991 build CFG after CPS Might & Shivers 2006,2007 solve problems introduced by CFG Vardoulakis & Shivers 2010,2011 solve problems introduced by CPS def g(f,x): return f(x) def h1(x): return x+1 def h2(x): return x+2

2. Problems with Control- Flow Graph Where is the CFG target? CFGs are tricky to build for high- order programs Attempts to Solution: build CFGs have led to complications and Don t limitations CPS the input in program control- flow analysis Don t try constructing the CFG Use direct- style, recursive abstract interpreter Shivers 1988, 1991 build CFG after CPS Might & Shivers 2006,2007 solve problems introduced by CFG Vardoulakis & Shivers 2010,2011 solve problems introduced by CPS def g(f,x): return f(x) def h1(x): return x+1 def h2(x): return x+2

3. Problems with Dynamic Field Creation/ Deletion class A: x = 1 obj = A() obj.y = 3 print obj.x, obj.y

3. Problems with Dynamic Field Creation/ Deletion class A: x = 1 obj = A() obj.y = 3 print obj.x, obj.y Solution:

3. Problems with Dynamic Field Creation/ Deletion class A: x = 1 obj = A() obj.y = 3 print obj.x, obj.y Solution:

3. Problems with Dynamic Field Creation/ Deletion class A: x = 1 obj = A() obj.y = 3 print obj.x, obj.y Solution: create abstract objects at constructor calls

3. Problems with Dynamic Field Creation/ Deletion class A: x = 1 obj = A() obj.y = 3 print obj.x, obj.y Solution: create abstract objects at constructor calls

3. Problems with Dynamic Field Creation/ Deletion class A: x = 1 obj = A() obj.y = 3 print obj.x, obj.y Solution: create abstract objects at constructor calls Actually change the abstract objects when fields are created

3. Problems with Dynamic Field Creation/ Deletion class A: x = 1 obj = A() obj.y = 3 print obj.x, obj.y Solution: create abstract objects at constructor calls Actually change the abstract objects when fields are created

3. Problems with Dynamic Field Creation/ Deletion class A: x = 1 obj = A() obj.y = 3 print obj.x, obj.y Solution: create abstract objects at constructor calls Actually change the abstract objects when fields are created Classes are not affect by the change

4. Problems with More Powerful Dynamic Features direct operations on dict (e.g. setattr, delattr, ) dynamic object reparenting import hacks eval

4. Problems with More Powerful Dynamic Features direct operations on dict (e.g. setattr, delattr, ) dynamic object reparenting import hacks eval Solution: Python Style Guide

Overall Structure of Analysis AST node (Expr) Type Env (Env) Stack (Env) Abstract Interpreter (AI) - (Expr,Env,Stk) - > Type Type

Overall Structure of Analysis AST node (Expr) Type Env (Env) Stack (Env) direct- style recursive Abstract Interpreter (AI) - (Expr,Env,Stk) - > Type Type

Overall Structure of Analysis AST node (Expr) Type Env (Env) Stack (Env) direct- style recursive Abstract Interpreter (AI) - (Expr,Env,Stk) - > Type during execution Global Information Table (GIT) { x@1 :: int, f@12 :: int - > bool, z@23 :: unbound variable } Type

Overall Structure of Analysis direct- style (not CPSed) no CFG generated keep it clean (don t record anything on it) AST node (Expr) Type Env (Env) Stack (Env) direct- style recursive Abstract Interpreter (AI) - (Expr,Env,Stk) - > Type during execution Global Information Table (GIT) { x@1 :: int, f@12 :: int - > bool, z@23 :: unbound variable } Type

Overall Structure of Analysis direct- style (not CPSed) no CFG generated keep it clean (don t record anything on it) AST node (Expr) Type Env (Env) Stack (Env) direct- style recursive loop detection (using stack) Abstract Interpreter (AI) - (Expr,Env,Stk) - > Type during execution Global Information Table (GIT) { x@1 :: int, f@12 :: int - > bool, z@23 :: unbound variable } Type

Actual Code of Main Interpreter

Actual Code of Main Interpreter input expression type environment stack of nodes on path (recursion detection)

Actual Code of Main Interpreter input expression type environment stack of nodes on path (recursion detection) lookup variable s type reocord type to GIT return a type

Actual Code of Main Interpreter input expression type environment stack of nodes on path (recursion detection) lookup variable s type reocord type to GIT return a type record error to GIT

Actual Code of Main Interpreter input expression type environment stack of nodes on path (recursion detection) lookup variable s type reocord type to GIT make closures for functions return a type invoke function (closure) record error to GIT

Multiple- Worlds Model HardQuestion()? x = 1 y = hello x = false y = 2.3 z = x + y

Multiple- Worlds Model HardQuestion()? split world x = 1 y = hello x = false y = 2.3 z = x + y

Multiple- Worlds Model HardQuestion()? split world x = 1 y = hello x = false y = 2.3 x : int y : str x : bool y : float z = x + y

Multiple- Worlds Model HardQuestion()? split world x = 1 y = hello x = false y = 2.3 x : int y : str x : bool y : float merge world z = x + y

Multiple- Worlds Model HardQuestion()? split world x = 1 y = hello x = false y = 2.3 x : int y : str x : bool y : float merge world z = x + y x : { int bool } y : { str float }

Multiple- Worlds Model HardQuestion()? split world x = 1 y = hello x = false y = 2.3 Clone the state for backtracing x : int y : str x : bool y : float merge world z = x + y x : { int bool } y : { str float }

Recursion Detection (1) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5)

Recursion Detection (1) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) Assumption: the same call site with the same argument types always produces the same output type (or nontermination)

Recursion Detection (1) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@7, int> return n * fact(n- 1) Assumption: the same call site with the same argument types always produces the same output type (or nontermination)

Recursion Detection (1) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@7, int> return n * fact(n- 1) Assumption: the same call site with the same argument types always produces the same output type (or nontermination) fact@7 may return int

Recursion Detection (1) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@7, int> return n * fact(n- 1) Assumption: the same call site with the same argument types always produces the same output type (or nontermination) fact@7 may return int < fact@5, int>

Recursion Detection (1) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@7, int> return n * fact(n- 1) not on stack not a loop Assumption: the same call site with the same argument types always produces the same output type (or nontermination) fact@7 may return int < fact@5, int>

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5)

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@5, int> < fact@7, int> return n * fact(n- 1)

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@5, int> < fact@7, int> return n * fact(n- 1) fact@5 may return int

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@5, int> < fact@7, int> return n * fact(n- 1) fact@5 may return int < fact@5, int>

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@5, int> < fact@7, int> return n * fact(n- 1) on stack loop detected! fact@5 may return int < fact@5, int>

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@5, int> < fact@7, int> return unknown type return n * fact(n- 1) on stack loop detected! fact@5 may return int < fact@5, int>

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@5, int> < fact@7, int> return unknown type return n * fact(n- 1) on stack loop detected! fact@5 may return int return unknown type and unify with int (possible false- negative here) < fact@5, int>

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@5, int> < fact@7, int> return unknown type return n * fact(n- 1) on stack loop detected! fact@5 may return int return unknown type and unify with int (possible false- negative here) < fact@5, int> Call fact@5 returns int finally Record fact@5 :: int

Recursion Detection (2) 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) return 1 n = 0? < fact@5, int> < fact@7, int> return unknown type return n * fact(n- 1) on stack loop detected! Call fact@7 returns int finally Record fact@1:: int - > int fact@5 may return int Call fact@5 returns int finally Record fact@5 :: int return unknown type and unify with int (possible false- negative here) < fact@5, int>

Correctness of Recursion Detection Every program is a dynamic circuit 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) Every call site is a conjuction point in the dynamic circuit, because it connects to an instance of a function body The same call site with the same arguments is a unique joint point in the process graph, with a deterministic future If the same <call site, argument type> combination has appear before in the path, there must be a loop

Correctness of Recursion Detection Every program is a dynamic circuit 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: 5: return n * fact(n- 1) 6: 7: fact(5) Every call site is a conjuction point in the dynamic circuit, because it connects to an instance of a function body The same call site with the same arguments is a unique joint point in the process graph, with a deterministic future If the same <call site, argument type> combination has appear before in the path, there must be a loop

Correctness of Recursion Detection Every program is a dynamic circuit 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: <fact@5, int> 5: return n * fact(n- 1) 6: 7: fact(5) Every call site is a conjuction point in the dynamic circuit, because it connects to an instance of a function body The same call site with the same arguments is a unique joint point in the process graph, with a deterministic future If the same <call site, argument type> combination has appear before in the path, there must be a loop

Correctness of Recursion Detection Every program is a dynamic circuit 1: def fact(n): 2: if (n == 0): 3: return 1 4: else: <fact@5, int> 5: return n * fact(n- 1) 6: 7: fact(5) Every call site is a conjuction point in the dynamic circuit, because it connects to an instance of a function body The same call site with the same arguments is a unique joint point in the process graph, with a deterministic future If the same <call site, argument type> combination has appear before in the path, there must be a loop

Related Work Similar to control- flow analyses, but much simpler No need to build CFG (as in original CFAs) No need to maintain stack manually (as in CFA2) CFG here is dynamic and implicit (maybe impossible to build statically) Doesn t record any information on the AST Recursive style leads to full utilization of host language Much simpler than type inferencer of JSCompiler (Google s type inference and static checker for JavaScript) JSCompiler also needs type annotations, iirc Very similar to NCI (Near Concrete Interpretation) But using another way to detect recursion

Connections to Deeper Theories In essence, the analysis is doing a simple version of supercompilation Similar to technique used by automatic theorem provers such as ACL2 Does not track as much information (only type information is tracked) Termination technique is more efficient (no expensive homeomorphic embedding checks).. but may not be as accurate may cause false- negatives!

Limitations Doesn t process bytecode. Needs all source code to be available (except for built- ins which was hard- coded or mocked) Does not track value/range of numbers Does not track heap storage (assume side- effects on heap won t affect typing) May produce false- negatives at recursions Worst- case complexity is high More approximations can be used to improve efficiency (at the cost of reducing accuracy) Error reports are not user- friendly for deep bugs

Applicability A general way of type inference/static analysis Can be applied to any programming language More useful for dynamic languages because type annotations of static languages make it a lot easier and more modular There are always trade- offs though

Availability 2009 version Jython Indexer (in Java, open- source) modular analysis with unification (similar to HM system) can t resolve some names fast currently used by Google for building code index open- sourced to Jython 2010 version PySonar (in Java, not open- source) inter- procedural analysis can resolve most names can detect deeper semantic bugs slow 2011 version mini- pysonar (in Python, open- source) available from GitHub contains only the essential parts for illustrating the idea

Possible Future Work Apply the technique to other (hopefully simpler) languages Publish a paper about the general method Derive other ideas from the same intuition

Thank you!