Decompilation of Binary Programs & Structuring Decompiled Graphs

Similar documents
Decompilation of Binary Programs

Reverse Engineering Techniques

Structuring Decompiled Graphs

Lecture Notes on Decompilation

Reverse Compilation Techniques. Cristina Cifuentes

COMPILER DESIGN LEXICAL ANALYSIS, PARSING

Software Protection: How to Crack Programs, and Defend Against Cracking Lecture 3: Program Analysis Moscow State University, Spring 2014

Assembly to High-Level Language Translation. Cristina Cifuentes, Doug Simon and Antoine Fraboulet. The University of Queensland

PRESENTED BY: SANTOSH SANGUMANI & SHARAN NARANG

Challenges in the back end. CS Compiler Design. Basic blocks. Compiler analysis

Part 7. Stacks. Stack. Stack. Examples of Stacks. Stack Operation: Push. Piles of Data. The Stack

Static Analysis of Binary Code to Isolate Malicious Behaviors Λ

Objectives. ICT106 Fundamentals of Computer Systems Topic 8. Procedures, Calling and Exit conventions, Run-time Stack Ref: Irvine, Ch 5 & 8

Binary Code Analysis: Concepts and Perspectives

PSD3A Principles of Compiler Design Unit : I-V. PSD3A- Principles of Compiler Design

Compiler Design. Computer Science & Information Technology (CS) Rank under AIR 100

Functional Units of a Modern Computer

Programming Language Processor Theory

Practical Malware Analysis

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

Summary: Direct Code Generation

Statements or Basic Blocks (Maximal sequence of code with branching only allowed at end) Possible transfer of control

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Assembly Language Fundamentals


CONTROL FLOW ANALYSIS. The slides adapted from Vikram Adve

Module 14: Approaches to Control Flow Analysis Lecture 27: Algorithm and Interval. The Lecture Contains: Algorithm to Find Dominators.

T Jarkko Turkulainen, F-Secure Corporation

SYSTEMS PROGRAMMING. Srimanta Pal. Associate Professor Indian Statistical Institute Kolkata OXFORD UNIVERSITY PRESS

Function Calls COS 217. Reading: Chapter 4 of Programming From the Ground Up (available online from the course Web site)

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Review Topics. parameter passing, memory model)

Detecting Self-Mutating Malware Using Control-Flow Graph Matching

Control-Flow Analysis

Language Translation. Compilation vs. interpretation. Compilation diagram. Step 1: compile. Step 2: run. compiler. Compiled program. program.

Control Flow Analysis

Goals of Program Optimization (1 of 2)

Control Flow Graphs. (From Matthew S. Hetch. Flow Analysis of Computer Programs. North Holland ).

Control Flow Analysis

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

Chapter 7 Subroutines. Richard P. Paul, SPARC Architecture, Assembly Language Programming, and C

ELEC 876: Software Reengineering

An Overview to Compiler Design. 2008/2/14 \course\cpeg421-08s\topic-1a.ppt 1

Data Structures and Algorithms in Compiler Optimization. Comp314 Lecture Dave Peixotto

Tutorial 1: Programming Model 1

Chapter 6 Control Flow. June 9, 2015

Program Control Instructions


CS266 Software Reverse Engineering (SRE) Reversing and Patching Wintel Machine Code

Compiler Structure. Data Flow Analysis. Control-Flow Graph. Available Expressions. Data Flow Facts

Sendmail crackaddr - Static Analysis strikes back

Object Code (Machine Code) Dr. D. M. Akbar Hussain Department of Software Engineering & Media Technology. Three Address Code

Dr. D.M. Akbar Hussain

e-pg Pathshala Subject : Computer Science Paper: Embedded System Module: Interrupt Handling Module No: CS/ES/13 Quadrant 1 e-text

Register Machines. Connecting evaluators to low level machine code

Computer Organization & Assembly Language Programming. CSE 2312 Lecture 15 Addressing and Subroutine

RetDec: An Open-Source Machine-Code Decompiler. Jakub Křoustek Peter Matula

Example (cont.): Control Flow Graph. 2. Interval analysis (recursive) 3. Structural analysis (recursive)

SPADE: SEMANTICALLY PRESERVING ABSTRACT DECOMPILER EXPERIMENT

Compiling Techniques

VARDHAMAN COLLEGE OF ENGINEERING (AUTONOMOUS) Shamshabad, Hyderabad

von Neumann Architecture Basic Computer System Early Computers Microprocessor Reading Assignment An Introduction to Computer Architecture

Basic Computer System. von Neumann Architecture. Reading Assignment. An Introduction to Computer Architecture. EEL 4744C: Microprocessor Applications

Assembly Language Programming of 8085

Computer Hardware and System Software Concepts

Computer Systems Lecture 9

CSC 2400: Computing Systems. X86 Assembly: Function Calls"

CSIS1120A. 10. Instruction Set & Addressing Mode. CSIS1120A 10. Instruction Set & Addressing Mode 1

THEORY OF COMPILATION

Process Context & Interrupts. New process can mess up information in old process. (i.e. what if they both use the same register?)

G Programming Languages Spring 2010 Lecture 4. Robert Grimm, New York University

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Optimization. ASU Textbook Chapter 9. Tsan-sheng Hsu.

2.1. Basic Assembler Functions:

IR Optimization. May 15th, Tuesday, May 14, 13

Have difficulty identifying any products Not incorporating embedded processor FPGA or CPLD In one form or another

Module 8: Atmega32 Stack & Subroutine. Stack Pointer Subroutine Call function

LAB 3: Programming in Assembly Language

4 Categories Of 8085 Instructions That Manipulate Data

No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations

COMPILER DESIGN LECTURE NOTES

Native x86 Decompilation using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring

Programming Model 2 A. Introduction

Group B Assignment 9. Code generation using DAG. Title of Assignment: Problem Definition: Code generation using DAG / labeled tree.

CodeSurfer/x86 A Platform for Analyzing x86 Executables

Flow Control. CSC215 Lecture

8 Optimisation. 8.1 Introduction to Optimisation

COSC345 Software Engineering. Basic Computer Architecture and The Stack

Basic Assembly SYSC-3006

Intel x86 Jump Instructions. Part 5. JMP address. Operations: Program Flow Control. Operations: Program Flow Control.

RECODER - The Architecture of a Refactoring System

Generation. representation to the machine

Principle of Complier Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

About the Authors... iii Introduction... xvii. Chapter 1: System Software... 1

Intel x86 Jump Instructions. Part 5. JMP address. Operations: Program Flow Control. Operations: Program Flow Control.

The basic operations defined on a symbol table include: free to remove all entries and free the storage of a symbol table

Machine-level Representation of Programs. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS4215 Programming Language Implementation

CS 2210 Programming Project (Part IV)

Control flow and loop detection. TDT4205 Lecture 29

Transcription:

Decompilation of Binary Programs & Structuring Decompiled Graphs Ximing Yu May 3, 2011 1 Introduction A decompiler, or reverse compiler, is a program that attempts to perform the inverse process of the compiler: given an executable program compiled in any high-level language, the aim is to produce a high-level language program that performs the same function as the executable program. Therefore Input: machine dependent Output: language dependent 1.1 Probllems Several practical problems are faced when writing a decompiler: Representation of data and instructions in the Von Neumann architecture: they are indistinguishable. The great number of subroutines introduced by the compiler and the linker, such as start-up subroutines that set up its environment. 1.2 Uses of Decompiler There are several uses for a decompiler, including two major software areas: maintenance of code and software security. Maintenance: Recovery of lost source code Migration of applications to a new hardware platform Translation of code written in an obsolete language into a newer language Structuring of old code written in an unstructured way Debugger tool that helps in finding and correcting bugs in an existing binary program. Software security: a binary program can be checked for the existence of malicious code (e.g. viruses) before it is run for the first time on a computer. 1

2 The Decompiler Structure There are three main modules: Front-end: a machine-dependent module that reads in the program, loads it into virtual memory and parses it Universal Decompiling Machine: a machine- and language-independent module that analyses the program in memory Back-end: a language-dependent module that writes formatted output for the target language 2.1 The front-end The front-end module deals with machine-dependent features and produces a machine-independent representation. Input: a binary program for a specific machine Output: an intermediate representation of the program, and the programs control flow graph. The front-end module consists of: Loader: loads a binary program into virtual memory Parser: Disassembles code starting at the entry point given by the loader. Follows the instructions sequentially until a change in flow of control is met. All instruction paths are followed in a recursive manner. The intermediate code is generated and the control flow graph is built. Semantic analysis: performs idiom analysis and type propagation. Two levels of intermediate code are required: A low-level representation that resembles the assembler from the machine: mapping of machine instructions to assembler mnemonics. A higher-level representation that resembles statements from a high-level language:generated by the inter-procedural data flow analysis. The front-end generates a low-level representation. 2.2 The universal decompiling machine The universal decompiling machine (UDM) is an intermediate module that is totally machine and language independent. It deals with flow graphs and the intermediate representation of the program and performs all the flow analysis the input program needs. 2

2.2.1 Data flow analysis Transform the low-level intermediate representation into a higher-level representation that resembles a HLL statement. Eliminate the concept of condition codes (or flags) and registers, as these concepts do not exist in high-level languages. Condition codes Condition codes are classified in two groups: HLCC which is the set of condition codes that are likely to have been generated by a compiler (e.g. overflow, carry), and NHLCC which is the set of condition codes that are likely to have been generated by assembly code (e.g. trap, interrupt). A use/definition, or reaching definition, analysis is performed on condition codes. Once the set of defined/used instructions is known, these low-level instructions can be replaced by a unique conditional high-level instruction that is semantically equivalent to the given instructions. Registers Low-level instructions are classified into two sets: HLI which is the set of instructions that are representable in a high-level language (e.g. mov, add), and NHLI which is the set of instructions that are likely to be generated only by assembly code (e.g. cli, ins). The elimination of temporary registers leads to the elimination of intermediate instructions by replacing several low-level instructions by one high-level instruction that does not make use of the intermediate register. Two preliminary analysis are required: a definition/use analysis on registers, and, an interprocedural live register analysis. The method to eliminate registers is forward substitution. Data flow analysis also introduces the concept of expressions and parameter passing, as these can be used in any HLL program. The high-level instructions that are generated by this analysis are: asgn (assign) jcond (conditional jump) jmp (unconditional jump) call (sub-routine call) ret (sub-routine return) No control structure instructions are restored by this phase. 2.2.2 Control flow analysis The control flow analyzer structures the control flow graph into generic high-level control structures that are available in most languages, including: Loops pre-test loop: while() 3

post-test loop: repeat...until() infinite loop: loop Conditionals 2-way conditionals: if...then and if...then...else n-way conditionals: case There are three types of nodes of subgraphs that represent high-level loops and 2-way structures: Header node: entry node of a structure. Follow node: the first node that is executed after a possibly nested structure has finished. Latching node: the last node in a loop; the one that takes as immediate successor the header of a loop. Structuring Loops By Interval theory, an interval I(h) is the maximal, single-entry subgraph in which h is the only entry node and in which all closed paths contain h. The unique interval node h is called the header node. By selecting the proper set of header nodes, graph G can be partitioned into a unique set of disjoint interval I = {I(h 1 ),I(h 2 ),...,I(h n )} The derived sequence of graphs, G 1...g n is based on the intervals of graph G. The first order graph, G 1, is G. The second order graph, G 2, is derived from G 1, by collapsing each interval in G 1 into a node. Given an interval I(h j ) with header h j, there is a loop rooted at h j if there is a back-edge to the header node h j from a latching node n k I(h j ). Once a loop has been found, the type of loop is determined by the type of header and latching nodes of the loop. A while() loop is characterized by a 2-way header node and a 1-way latching node. A repeat...until() is characterized by a 2-way latching node a non-conditional header node. A endless loop loop is characterized by a 1-way latching node and a non-conditional header node. The algorithms used are: 1. Each header of an interval in G 1 is checked for having a back-edge from a latching node that belongs to the same interval. 2. If this happens, a loop has been found, so its type is determined, and the nodes that belong to it are marked. 3. Next, the intervals of G 2, I 2 are checked for loops, and the process is repeated until intervals in I n have been checked. 4

Structuring 2-Way Conditionals Both a single branch conditional (i.e. if...then) and an if...then...else conditional subgraph have a common follow node that has the property of being immediately dominated by the 2-way header node. When these subgraphs are nested, they can have different follow nodes or share the same common follow node. During loop structuring, a 2-way node that belongs to either the header or the latching node of a loop is marked as being part of the loop, and must therefore not be processed during 2-way conditional structuring. Compound Conditions Whenever a subgraph of the form of the short-circuit evaluated graphs is found, it is checked for the following properties: 1. Nodes x and y are 2-way nodes. 2. Node y has only 1 in-edge. 3. Node y has a unique instruction; a conditional jump (jcond) high-level instruction. 4. Nodes x and y must branch to a common t or e node. 2.3 The back-end The back-end module is language dependent, as it deals with the target high-level language. It has the following phases: Restructuring (optional): structuring the graph even further, so that control structures available in the target language but not present in the generic set of control structures of the structuring algorithm, previously described, are utilized. HLL code generation: generates code for the target HLL based on the control flow graph and the associated high-level intermediate code. Involves: Defines global variables. Emits code for each procedure/function following a depth first traversal of the call graph of the program. If a goto instruction is required, a unique label identifier is created and placed before the instruction that takes the label. Variables and procedures are given names of the form loc1, proc2. 3 The Decompiling System The decompiling system that integrates a decompiler, dcc, and an automatic signature generator, dccsign. 3.0.1 Signature generator A signature generator is a front-end module that generates signatures for compilers and library functions of those compilers. Such signatures are stored in a database, and are accessed by dcc to check whether a subroutine is a library function or not, in which case, the function is not analyzed by dcc, but replaced by its library name (e.g. printf()). 5