Lessons Learned in Static Analysis Tool Evaluation. Providing World-Class Services for World-Class Competitiveness

Similar documents
Gaps in Static Analysis Tool Capabilities. Providing World-Class Services for World-Class Competitiveness

Static Analysis of C++ Projects with CodeSonar

Verification & Validation of Open Source

Using Static Code Analysis to Find Bugs Before They Become Failures

Micro Focus Security Fortify Audit Assistant

Lightweight Verification of Array Indexing

Classification Part 4

Types, Universes and Everything

Static Analysis methods and tools An industrial study. Pär Emanuelsson Ericsson AB and LiU Prof Ulf Nilsson LiU

Artificial Intelligence. Programming Styles

IS 2620: Developing Secure Systems. Building Security In Lecture 2

An Evaluation of the Advantages of Moving from a VHDL to a UVM Testbench by Shaela Rahman, Baker Hughes

Static Analysis versus Software Model Checking for bug finding

Retrieval Evaluation. Hongning Wang

HOW TO TEST CROSS-DEVICE PRECISION & SCALE

Intelligence-driven Approaches to

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Software Tools for Source Code Analysis

Static Analysis in C/C++ code with Polyspace

Classification. Instructor: Wei Ding

IS 2620: Developing Secure Systems. Lecture 2 Sept 8, 2017

COS 320. Compiling Techniques

Code generation and local optimization

18-642: Code Style for Compilers

Code generation and local optimization

Chapter 8. Evaluating Search Engine

Web Applications (Part 2) The Hackers New Target

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

Viewpoint Review & Analytics

Lesson 5: Surface Check Tools

Lecture 10. Pointless Tainting? Evaluating the Practicality of Pointer Tainting. Asia Slowinska, Herbert Bos. Advanced Operating Systems

CS2141 Software Development using C/C++ Debugging

18-642: Code Style for Compilers

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

CYSE 411/AIT 681 Secure Software Engineering. Topic #6. Seven Software Security Touchpoints (III) Instructor: Dr. Kun Sun

Principles of Software Construction: Objects, Design, and Concurrency (Part 2: Designing (Sub )Systems)

4. Risk-Based Security Testing. Reading. CYSE 411/AIT 681 Secure Software Engineering. Seven Touchpoints. Application of Touchpoints

Automotive Software Security Testing

정형기법을활용한 AUTOSAR SWC 의구현확인및정적분석

Static Analysis of Embedded C

Honours/Master/PhD Thesis Projects Supervised by Dr. Yulei Sui

Leveraging Formal Verification Throughout the Entire Design Cycle

The data quality trends report

A Practical Optional Type System for Clojure. Ambrose Bonnaire-Sergeant

The Need for Speed: Understanding design factors that make multicore parallel simulations efficient

Snort: The World s Most Widely Deployed IPS Technology

OPC Tunnel for PlantTriage

SafeRiver SME. Added Value Solutions for Embedded Systems. Tools for FuSa and Software Security. Packaged Services. CIR agreed

Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity

Checking Program Properties with ESC/Java

Weka ( )

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Testing! Prof. Leon Osterweil! CS 520/620! Spring 2013!

Verification and Test with Model-Based Design

Oracle Developer Studio Code Analyzer

MC: Meta-level Compilation

Simple Overflow. #include <stdio.h> int main(void){ unsigned int num = 0xffffffff;

FiE on Firmware Finding Vulnerabilities in Embedded Systems using Symbolic Execution. Drew Davidson Ben Moench Somesh Jha Thomas Ristenpart

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Section 4 General Factorial Tutorials

Verification and Validation

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

CSE Verification Plan

Programming vulnerabilities for C++ (part of WG23 N0746)

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Code generation and optimization

TUNING CUDA APPLICATIONS FOR MAXWELL

PEACHTECH PEACH API SECURITY AUTOMATING API SECURITY TESTING. Peach.tech

CS533 Modeling and Performance Evaluation of Network and Computer Systems

STATUS 5. A Reliability Assessment Tool For NDT Inspection Systems

Managed Application Security trends and best practices in application security

DR. CHECKER. A Soundy Analysis for Linux Kernel Drivers. University of California, Santa Barbara. USENIX Security seclab

Tracking Pointers with Path and Context Sensitivity for Bug Detection in C Programs. {livshits,

AN ALGORITHM FOR BLIND RESTORATION OF BLURRED AND NOISY IMAGES

Intrusion Detection Systems (IDS)

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Secure Development Lifecycle

Jan Pedersen 22 July 2010

Finding Vulnerabilities in Web Applications

Verifying source code

1.1 For Fun and Profit. 1.2 Common Techniques. My Preferred Techniques

Lecture 9 Dynamic Compilation

CSE 565 Computer Security Fall 2018

Post Silicon Electrical Validation

2 TEST: A Tracer for Extracting Speculative Threads

Testing. ECE/CS 5780/6780: Embedded System Design. Why is testing so hard? Why do testing?

Stephen McLaughlin. From Uncertainty to Belief: Inferring the Specification Within

Evaluating Bug Finders

It was a dark and stormy night. Seriously. There was a rain storm in Wisconsin, and the line noise dialing into the Unix machines was bad enough to

Visible and Long-Wave Infrared Image Fusion Schemes for Situational. Awareness

Testing Simulink Models

Static Analysis and Dataflow Analysis

TUNING CUDA APPLICATIONS FOR MAXWELL

Engineering Your Software For Attack

CSCE 626 Experimental Evaluation.

IBM. OA VTAM 3270 Intrusion Detection Services - Overview, Considerations, and Assessment (Prerequisite) z/os Communications Server

Program Verification. Aarti Gupta

SoK: Eternal War in Memory

Low level security. Andrew Ruef

AstréeA From Research To Industry

Transcription:

Lessons Learned in Static Analysis Tool Evaluation 1

Overview Lessons learned in the evaluation of five (5) commercially available static analysis tools Topics Licensing Performance Measurement Limitations of Test Cases Flaw Reporting Tool Output Tool Tuning Tool Suites Summary 2

Licensing Licensing schemes vary among vendors and have very little in common Terminology ambiguity (or the same term applied differently by vendors) Lines of Code (LOC) Codebase Impacts Return on investment is impacted based on usage Secondary releases of code can be impacted by a cumulative LOC count Restrictions on access to output data tied to license scheme Current license schemes better suited for Software (Sw) Developers vs. Sw Auditors (third party evaluators) 3

Performance Measurement A consistent set of criteria is required for measuring tool performance A true positive (TP) is a flaw/weakness identified by a tool that is proven to be an actual flaw/weakness A false positive (FP) is a flaw/weakness identified by a tool that is proven not to be an actual flaw/weakness A false negative (FN) is a known flaw/weakness that the tool did not identify Determining FN rates is problematic as it requires a thorough knowledge or annotation of the code The lack of an accurate FN count makes determining the accuracy of the tools problematic 4

Limitations of Test Cases Results very sensitive to test case selection Tools perform differently with different languages Multi-language data set use is desired Flaws are grouped at a high level (flaw type) Many distinct flaws are categorized as buffer overflow Need to cover the detail of each flaw types Complicating factors include: Read vs. write, scope, indirection, control flow Natural vs. Seeded flaws Scalability test difficulty 5

Limitations of Test Cases (2) Natural Code Exhibits complicated paths Employs real data structs Uses advanced techniques Can stress scalability Exhibits idioms Unknown design intent Unknown # of real flaws Hard to automate analysis Seeded Code Flaw content (mostly) known Flaw independence Result equivalence easier to determine Auto generation/testing Very limited in complexity Flaw frequency distribution does not mirror real world 6

Limitations of Test Cases (3) Many programs generate code during build Code is not always generated in an easy to follow style (complicates review) Code base is usually very robust Code base may contain dead or useless code Some tools will not analyze generated code Causes problems: One TP or FP can appear as many Unusual code structure can cause lots of FPs 7

Flaw Reporting Standard formats w/incomplete data Level of detail provided varies Some tools provided flaw source, line #, and trace back links, and examples etc.; others are limited Trend analysis varied from non-existing to excellent Different levels of checkers granularity No application programming interface (API) to audit results or access to results data base 8

Flaw Reporting (2) 184 fprintf(f, uri); 185 fprintf(f, "\r\n"); 186 for (i = 0; i < nh; ++i) 187 fprintf(f, h[i]); Line 184 was correctly identified as containing a Format String vulnerability. The variable uri in this program is user controllable and is passed unchecked to the format parameter of fprintf. Line 187 was identified as a Tainted Data/Unvalidated User Input vulnerability. It is correct that the data contained in h[i] is unchecked user supplied data, however, like line 184, this is also passed to the format parameter of fprintf. This issue is more a Format String vulnerability than it is a Tainted Data issue. 9

Tool Output Tools vary in the types of flaws they detect Certain checkers tended to produce a large number of FPs (noise generators) These need to be identified Difficult without prior knowledge Manual analysis Manually inspect the flaw to determine validity Open to human interpretation Experienced human auditor should always make better judgment than tool Time consuming So how do you make a somewhat objective comparison of tool performance? 10

Tool Output (2) Objective: Identify 25 flaws types that were detected by all 5 tools Only 3 flaw types were detected by all 5 tools Only 10 flaws types were detected by 4 out of 5 tools Improper Return Value Used, Buffer Overflow, Command Injection, Format String, Memory Leak, Null Pointer Dereference, Tainted Data, Time of Check to Time of Use (TOCTOU), Use After Free, Un-initialized Value This set ended up being of interest to the evaluation because of coverage not because of the flaw type itself There were some additional flaws covered by the tools, but the code base did not provide TP hits Evaluate the tools at their default and ideal settings Reduce the noisy checkers 11

Tool Tuning Not all tools are tunable Some allow tuning in the analysis and/or post analysis phase Analysis Phase Enable/disable analysis engines, algorithms Set thresholds, filtering of results Create custom models or rules to help analysis Post-Analysis Phase (Review and Audit) Filtering Categorization Prioritization More manual analysis Potential loss of raw results 12

Tool Tuning - Virtual Initially, requires sample code bases of known flaws Sample code base consists of analyzed natural and/or seeded code Can be further refined as your knowledge base of code grows Tuning is based upon the flaw checkers Turn on ALL flaw checking options in the analysis phase and run the tool against your code samples For each reported issue from the tool extract: source file, line number, flaw type, checker - mark it as a TP or FP Sort by flaw type, then artificially turn off each checker and plot how this affects the TP/FP rate Shows best TP/FP tuning by flaw type Allows for comparisons between tools No constraints on the tool during the analysis phase 13

Tool Tuning - Example Each point corresponds to a checker This tool has a lot of buffer overflow checkers; most of which generate FPs. Tune the tool here (the knee ); small reduction in TP rate but big reduction in FP rate Ideally, you would want at least three checkers to fire 14

Tool Suites Coverage by one tool is not sufficient Hard to compare tool performance based on coverage across the 10 common flaws as a whole Need to look at how tools perform on a flaw by flaw basis Trade off between aggressive and conservative tools Aggressive tools tend to report more TPs but also more FPs Conservative tools report fewer FPs but increase the FN rate A suite of tools will provide better coverage Maximized individual flaw coverage requires 2 or 3 tools 15

Tool Suites (2) Contribution to Total Reported TPs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Improper Return Value Use Format String Memory Leak TOCTOU Buffer Overflow Command Injection Uninitialized Variable Use Null Pointer Dereference Tainted Data Use After Free Tool E Tool D Tool C Tool B Tool A 16

Tool Suites (3) Contriution to Total Reported FPs 100% 80% 60% 40% 20% Tool E Tool D Tool C Tool B Tool A 0% Improper Return Value Use Format String Memory Leak TOCTOU Buffer Overflow Command Injection Uninitialized Variable Use Null Pointer Dereference Tainted Data Use After Free 17

Tool Suites (4) Buffer Overflow Tool Suite Results % of TPs 100 90 80 70 60 50 40 30 20 10 0 Tool A+B+C+D+E Tool A+B Tool A Tool C Tool B Tool D Tool E 0 10 20 30 40 50 60 70 80 90 100 100% - % of FPs Tool A Tool B Tool C Tool D Tool E Tool A+B Tool A+B+C+D+E 18

Summary Tool overlap is not necessary in a tool suite, but overlap can increase your confidence Performance measurement is relative Sensitive to test case selection No standard Lack of automation Analysis heavily dependent on human verification Virtual tuning enables you to identify individual tools strengths Disambiguation Complex problem Determining if two flaws are the same 19