Metamorphic Code Generation from LLVM IR Bytecode

Size: px
Start display at page:

Download "Metamorphic Code Generation from LLVM IR Bytecode"

Transcription

1 San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2013 Metamorphic Code Generation from LLVM IR Bytecode Teja Tamboli San Jose State University Follow this and additional works at: Recommended Citation Tamboli, Teja, "Metamorphic Code Generation from LLVM IR Bytecode" (2013). Master's Projects This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact

2 Metamorphic Code Generation from LLVM IR Bytecode A Project Presented to The Faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements for the Degree Master of Science by Teja Tamboli May 2013

3 c 2013 Teja Tamboli ALL RIGHTS RESERVED

4 The Designated Project Committee Approves the Project Titled Metamorphic Code Generation from LLVM IR Bytecode by Teja Tamboli APPROVED FOR THE DEPARTMENTS OF COMPUTER SCIENCE SAN JOSE STATE UNIVERSITY May 2013 Dr. Mark Stamp Dr. Robert Chun Dr. Sami Khuri Department of Computer Science Department of Computer Science Department of Computer Science

5 ABSTRACT Metamorphic Code Generation from LLVM IR Bytecode by Teja Tamboli Metamorphic software changes its internal structure across generations with its functionality remaining unchanged. Metamorphism has been employed by malware writers as a means of evading signature detection and other advanced detection strategies. However, code morphing also has potential security benefits, since it increases the genetic diversity of software. In this research, we have created a metamorphic code generator within the LLVM compiler framework. LLVM is a three-phase compiler that supports multiple source languages and target architectures. It uses a common intermediate representation (IR) bytecode in its optimizer. Consequently, any supported high-level programming language can be transformed to this IR bytecode as part of the LLVM compilation process. Our metamorphic generator functions at the IR bytecode level, which provides many advantages over previously developed metamorphic generators. The morphing techniques that we employ include dead code insertion where the dead code is actually executed within the morphed code and subroutine permutation. We have tested the effectiveness of our code morphing using hidden Markov model analysis.

6 ACKNOWLEDGMENTS Firstly, I would like to thank Dr. Mark Stamp, my project advisor, for his guidance and support throughout my Master s degree and project. Secondly, I would like to thank my committee members, Dr. Sami Khuri and Dr. Robert Chun for providing suggestions without which this project would not be possible. Finally, I would like to thank my family and my husband Onkar Deshpande for their encouragement, motivation, andl support. v

7 TABLE OF CONTENTS CHAPTER 1 Introduction Malware Malware Types Virus Worms Detection Techniques Signature Based Detection Anomaly Based Detection Hidden Markov Model Based Detection Metamorphic Techniques Register Swap Subroutine Permutation Dead Code Insertion Instruction Substitution Transposition Formal Grammar Mutation LLVM Introduction Architecture of LLVM Traditional Three Phase Compiler Design vi

8 4.2.2 LLVM Design LLVM IR Bytecode Tools in LLVM llvm-gcc (.c.s) llvm-as (.s.bc) lli Opt llvm-dis llc llvm-link Hidden Markov Models and Virus Detection HMM Example The Three Problems Forward Algorithm Backward Algorithm Baum-Welch Algorithm HMM and Virus Detection Log Likelihood Per Opcode Effectiveness of HMM Detection Evading HMM Detection Design and Implementation Introduction vii

9 6.2 Challenge and Innovation Goals Metamorphic Techniques Used Dead Code Insertion Function Permutation Implementation Dead Code Insertion Call Dead Functions Function Permutation Software Used Experiments Base File HMM Detection Conclusion APPENDIX Additional HMM results viii

10 LIST OF TABLES 1 HMM Notations [43] Probabilities of observing (S, M, S, L) for all possible state sequences 28 3 ROC AUC statistics for rate of dead code insertion LLVM optimizer passes used to remove dead code A.1 ROC AUC statistics for rate of dead code insertion ix

11 LIST OF FIGURES 1 Multiple shapes of a metamorphic virus body [21] Two different generations of RegSwap [10] Subroutine permutation [10] Zperm virus [10] A simple polymorphic decryptor and two variants [54] Formal grammar for decrpyptor mutation [54] Three major components of a three-phase compiler LLVM design [27] LLVM bytecode file format [37] Sample C code and its corresponding IR bytecode Program s life cycle in LLVM compiler Generic Hidden Markov Model [43] HMM Model [43] Extracted opcode sequence HMM Model Metamorphic code generator architecture diagram HMM results for base virus and benign files HMM results with 10% and 20% dead code insertion HMM results with 30% and 50% dead code insertion HMM results with rate of dead code insertion ROC curves for rate of dead code insertion x

12 22 HMM scores after using optimizers HMM scores after using link-time optimizer A.1 HMM results for base virus and benign files A.2 HMM results with 10% and 20% dead code insertion A.3 HMM results with 30% and 50% of dead code insertion A.4 HMM results with rate of dead code inserion xi

13 CHAPTER 1 Introduction Software is said to be metamorphic if multiple copies of the same software are structurally different, but functionally equivalent. Examples of such metamorphic code generators can be found at [3, 22, 42]. To date, metamorphic code generation has primarily been used by malware writers, since well-designed metamorphic code can evade signature-based detection and other more advanced detection strategies [22, 42, 53]. However, metamorphism also has the potential to provide security benefits by increasing the genetic diversity of software, thereby making several types of attacks more difficult [16, 45]. Well-designed metamorphic malware will evade signature-based detection. According to recent research [42, 53], techniques based on hidden Markov models (HMMs) can be used to detect some types of metamorphic malwares. Many metamorphic malware generators are readily available [51]. Some notable examples include G2 (Second Generation virus generator) [52] MPCGEN (Mass Code Generator) [51] NGVCK (Next Generation Virus Creation Kit) [50] VCL32 (Virus Creation Lab for Win32) [5] In addition, research morphing engines are presented in [22] and [42]. All of these metamorphic generators work at the assembly language level. Note that code mor- 1

14 phing of high-level source code is far simpler, but generally ineffective, since such morphing does not provide sufficient control over the resulting executable file. In this research, we have implemented a metamorphic code generator using the LLVM compiler framework [2]. The LLVM compiler infrastructure provides for compile-time, link-time, run-time, and idle-time optimization of programs written in arbitrary programming languages. It is a three-phase compiler that supports multiple source languages and multiple target architectures. In the optimization process, code is converted to an intermediate representation (IR) bytecode. Our code morphing tool functions at this IR bytecode level, which provides advantages that are somewhat analogous to morphing at the source-code level, but also provides the necessary fine-grained control over the resulting executable. Related research involving LLVM IR bytecode manipulation includes a malware encryption technique implemented as optimizer passes [17]. In addition, in [35], a shadow attack is developed using LLVM. This attack hides system call behavior for the purpose of making behavior-based detection of malware more difficult. We evaluate our morphing technique using the hidden Markov model (HMM) analysis developed in [53] which has been further developed and analyzed in [42, 43]. This body of previous work provides a baseline for determining the effectiveness of our approach. This paper is organized as follows. In Chapter 2, we provide background information on computer viruses and detection techniques. Chapter 3 describes generic metamorphic techniques used by metamorphic code generators. Chapter 4 explains the architecture of the LLVM compiler infrastructure and elaborates on IR byte code. Chapter 5 details the design and implementation of the HMM detector used to evalu- 2

15 ate our results. Chapter 6 covers the design and implementation of our metamorphic code generator. Experimental results are analyzed in Chapter 7. Chapter 8 contains our conclusion and suggestions for future work. 3

16 CHAPTER 2 Malware Malware is a program developed for performing malicious activities on a computer [36]. These malicious activities include crashing disks or operating system or alter system s data [35]. Writing malware is a challenging task [3] and could be a source of revenue for malware writers. Anti-virus softwares can detect these malicious activities and remove them from the computer. To date, most development and research into metamorphic code has involved malware. Therefore, we present background information on malware before turning our attention to the general case of metamorphic code generation. 2.1 Malware Types There are two most prominent types of malwares: Virus and Worm. They are discussed in details in following subsections Virus Virus-like programs first appeared on microcomputers in the 1980s [8]. Viruses do not have reproductive ability but it can replicate itself. They need human interaction to spread the infection from one computer to another. For example, they can get downloaded from Internet or by exchanging infected USB drives or floppy disks. Virus writers constantly develop new obfuscation technique to evade the signature based detection. Most important methods to evade the detection are encryption, polymorphic and modern metamorphic techniques [21]. These techniques are explained in following subsections. 4

17 Encrypted Viruses The simplest method to hide the virus body is to encrypt it with different encryption keys. Most part of the virus executable is encrypted and a small decryption module exists to decrypt the encrypted body. For example, XORing the key with the virus body [6]. Since the virus is encrypted with different key per infected file, only the decryption module remains constant across generations. There is no common signature in such type of viruses, but virus scanners can detect the decryption module. As a result, the code which encrypts or decrypts, is the part of the signature in most of such virus definitions [13] Polymorphic Viruses The problem of non-encrypted decryptor module used for detection is solved in these types of viruses. Decryptor module is mutated in each generation. In addition, polymorphic viruses can generate large number of unique decryptors by using different encryption methods. Therefore, not two infections have the same signature [22, 42]. A code emulator is used to detect polymorphic viruses. The emulator emulates decryption process and dynamically decrypts the encrypted virus body [13] Metamorphic Viruses To avoid detection by emulation, virus writers came up with an advanced technique to write malwares called metamorphism. Metamorphism changes the appearance of the virus while keeping its functionality. Metamorphic viruses don t need encryption or decryption techniques. They produce new virus body on each infection [13]. 5

18 Metamorphic engine can be kept separate or it can be embedded in the virus. In our implementation, we have kept code generator separate. Some metamorphic techniques are explained in Chapter 3. Figure 1 shows generations of a metamorphic virus whose shape changes but the functionality remains constant. Figure 1: Multiple shapes of a metamorphic virus body [21] Worms Worms are self-replicating malwares. Unlike viruses, they do not need any human intervention to spread. They are standalone [36, 46]. Similar techniques like metamorphism and polymorphism are used by worms to avoid detection. Worm could be a macro residing in a word document or in a excel sheet which spreads itself across the network. Worms spread from host to host across the network, unlike viruses which only infects executables residing on the host [36, 42]. 6

19 2.2 Detection Techniques As a malware evolves, detection mechanisms also evolve. This section elaborates on detection mechanisms employed by anti-virus software Signature Based Detection Signature based detection is a simple and most commonly used technique in anti-virus software. They are popular because of accurate detection, simplicity and speed [44]. In this detection mechanism, scanner scans each executable and looks for specific signatures. Signature consists of sequence of opcodes which uniquely identify the virus. Anti-virus software has database of signatures for different viruses. By comparing the signature, it detects the virus. This technique has to keep on updating database with new malware signatures. Another downside of this type of detection is it cannot detect new virus. Therefore, by using simple code obfuscation techniques, malwares can easily evade the signature based detection [38] Anomaly Based Detection The problem of detecting new malwares in signature based detection can be overcome using this type of detection mechanism. Heuristic methods are implemented to detect anomalous behavior. Primarily there are two phases: training and detection. In training, scanner learns normal and malicious behavior e.g. finding root password. After training, it is used to detect malicious activities [44]. Using this technique newbie viruses can be detected but it has downside too [18]. Since its detection is based on how you train the system, it has more number of false positives or false negatives. 7

20 2.2.3 Hidden Markov Model Based Detection Hidden Markov Model (HMM) based virus detection is a comparatively new technique. HMMs are probabilistic models and are widely used in solving problems on pattern recognition. They help in finding the probability of transition from one state to another. HMM has two phases: training and detection. Once you train the model, it can be used to detect benign and malware software [22, 42, 53]. 8

21 CHAPTER 3 Metamorphic Techniques The metamorphic code generator described in this paper makes use of morphing techniques like dead code insertion and function permutation. A number of techniques include simple technique as register swap to an advanced technique as formal grammar. Some of these techniques are explained in detail in following subsections. 3.1 Register Swap It is the most simplistic virus. In December 1998, W95/Regswap virus implemented metamorphosis via register usage exchange. Register swap means changing the register operands. For example, if instruction is PUSH ECX then it can be replaced with PUSH EAX. In this technique, opcode sequence remains the same. Figure 2 shows some sample code fragments selected from two different generations of W95/Regswap that use different registers. A wildcard string can be used to detect such type of viruses [10]. 3.2 Subroutine Permutation This technique changes the appearance of a virus by reordering the layout of the virus subroutines. If virus has n subroutines, then there can be n! generations without repetition. Some examples are BadBoy and W32/Ghost (discovered in May 2000). BadBoy had 8 subroutines, so 8! = 40, 320 (n! = n (n 1)!) generations and W32/Ghost had 10 functions, so 10! = 3, 628, 800 combinations. However, both of them can be detected with search strings as the content of each subroutine remains same [10]. As with register swapping, subroutine permutation is a relatively weak 9

22 malware morphing strategy, particularly with respect to statistical based detection. Figure 3 explains this technique. Figure 2: Two different generations of RegSwap [10] Figure 3: Subroutine permutation [10] 10

23 3.3 Dead Code Insertion Dead code can be inserted into an existing program. In its simplest form, dead code is never executed. Alternatively, dead code can be executed, provided that it has no effect on the overall program function. Although more difficult, this latter approach is more effective, since the dead code is more difficult to detect. Dead code can be very effective for evading malware detection, particularly with respect to statistical-based techniques the dead code can be used to mask the statistical properties of the underlying code. By adding such code, a virus can generate infinite number of unique copies. However, dead code insertion can be challenging at the assembly code level, since care must be taken so that addresses remain valid. Win95/Zperm virus appeared in June and September of 2000 incorporated this technique [10]. Figure 4 explains the code structure changes of Zperm-like viruses. Figure 4: Zperm virus [10] 3.4 Instruction Substitution In this technique, an instruction is substituted by another equivalent instruction or group of instructions with the same functionality. For example, MOV R1, R2 can be replaced by PUSH R1 and then POP R2. As another trivial example XOR R1, R1 and SUB R1, R1 both zero the contents of register 11

24 R1. But opcode of these two instructions are now different. Instruction substitution is a powerful technique for evading signature detection and altering code statistics. However, instruction substitution is relatively difficult to implement at the assembly code level. The W32/MetaPhor virus is one of the metamorphic virus generators that incorporate the instruction substitution technique [10]. 3.5 Transposition Transposition is reordering the sequence of instructions without changing the actual functionality. It means the sequence can be reordered only if two instructions are independent of each other. For example, instructions: 1. OPCODE [R1] [R2] 2. OPCODE [R3] [R4] These instructions can be swapped since both instructions are independent of each other. Therefore, they can be reordered with the following sequence without changing the resultant functionality. 1. OPCODE [R3], [R4] 2. OPCODE [R1], [R2] Similar technique can be applied to group of instructions. As the order of execution itself is different, it is used to evade the signature based detection. 3.6 Formal Grammar Mutation Formal grammar mutation is a formalization of many existing morphing techniques [7, 14, 54]. In general, traditional morphing engines are non-deterministic 12

25 automata (NDA), since transitions are possible from every symbol to every other symbol [54]. The symbol set is the set of all possible instructions. It means, any instruction can be followed by any other instruction. By formalizing mutation techniques, one can create formal grammar rules and can apply these rules to create viral copies with great variation. Figure 5 shows a simple polymorphic decryptor template and two possible mutations of the decryptor code achieved using the formal grammar. Figure 6 With this decryptor template and formal grammar combination, it is possible to generate 960 different decryptors [54]. Figure 5: A simple polymorphic decryptor and two variants [54] 13

26 Figure 6: Formal grammar for decrpyptor mutation [54] 14

27 CHAPTER 4 LLVM 4.1 Introduction Complexity of modern applications is increasing. They are growing in size, support dynamic upgrades, backups, recovery mechanisms. It is also common to write components in multiple languages to gain efficiency. During the lifetime of the application, certain components have small hot spots in terms of memory footprint or CPU; other spread their execution time evenly throughout the application. It is important to do program analysis and transformation throughout the lifetime of the application to understand the application behavior. Lifelong code optimizations include inter-procedural optimizations performed at link-time, machine-dependent optimization at install time, dynamic optimization at run-time and profile-fuided optimization between runs [27, 32]. LLVM (Low Level Virtual Machine) is a compiler infrastructure which has several novel features. It supports language independent instruction set. Each instruction is a Static Single Assignment (SSA). It means each variable is assigned once and then it cannot be reassigned [2, 32]. This is done by using numbers to represent variables. LLVM is part of GCC. It supports static compilation and late compilation from the Intermediate Representation (IR) code similar to Java s just-in-time (JIT) compiler. 4.2 Architecture of LLVM The origins of the LLVM infrastructure is in project called The Lifelong Code Optimization Project (LCO-Project) which started at the department of computer 15

28 science at the university of Illinois at Urbana-Champaign [37] Traditional Three Phase Compiler Design Most of the traditional static compilers (For example, GCC used for C/C++ programs) are three-phase compilers. Three main phases includes frontend, optimizer and backend. Figure 7, shows the typical design of three phase compilers [27]. Figure 7: Three major components of a three-phase compiler The key function of frontend component is to parse the source code, check for any syntactical errors and then build a language specific Abstract Syntax Tree (AST). Optimizers use this tree and transform it to a new representation by applying optimization techniques. From AST, it rearranges instructions in order to optimize the code. It removes duplicates, redundant computations. The backend is the machine dependent representation of the code (assembly code). It maps the code to target machine instruction set. Its main goal is to generate correct code that can take advantage of underlying hardware. Common parts of a compiler backend include instruction selection, register allocation, and instruction scheduling [27] LLVM Design The key feature of LLVM three-phase compiler design is, it supports multiple frontends and multiple backends by using a common intermediate representation. The frontend can be written in any language which essentially will be converted 16

29 to an intermediate representation. This intermediate representation is machine and language independent. A backend can be written for any target platform to compile from this common representation [1, 27]. Figure 8 shows LLVM design. Figure 8: LLVM design [27] Using this design it is now easy to add new language by implementing new frontend and reusing existing optimizers and backends. New platforms can be supported by implementing new backends. If these components were not separated, then to implement new language we have to start all over from the scratch. To support N languages and M targets we would need NM different compilers [27]. Another potential advantage of LLVM design is as these components are separated from each other, skills required to implement frontend are completely different than skills required for implementing backend and optimizer [2, 32]. Frontend person can only maintain or enhance their part of the compiler. This is not the technical issue, but for open source project, it reduces the barrier to contributing as much as possible. Finally, LLVM design serves broader set of programmers compared to traditional compilers supporting only one source language and target. Traditional open source compilers (like GCC) are stable and efficient because they serve larger communities. 17

30 This tends to generate better optimized machine code compared to narrower compilers like Free PASCAL [27]. 4.3 LLVM IR Bytecode The most important aspect of LLVM s design is to represent optimizer in common intermediate form (IR). This common form separates frontend and backend components from each other. It supports lightweight runtime optimizations, crossfunction or inter-procedural optimizations, entire program analysis and aggressive restructuring transformations. Figure 9 shows the LLVM IR s structure. It supports following sections [37]: 1. A module: The module is the container which has functions and global variables. 2. Functions: Functions are named, callable units of instructions. 3. Global variables: Global variables which can be accessed by any function. Figure 10 shows sample C program and its corresponding IR representation. In LLVM IR, logic is represented in the form of functions. Functions consist of a list of basic blocks. Each basic block consists of a set of instructions. Instructions in basic block are always executed sequentially. As IR bytecode is the only interface to the optimizer, writing this bytecode is a critical component. It should be easily generated by frontend and should be sufficiently expressive to perform optimizations to convert into real target machine code [27]. Frontend programmers should understand IR and its invariants. LLVM IR has a textual form. Therefore, frontend programmers should output LLVM IR in 18

31 Figure 9: LLVM bytecode file format [37] Figure 10: Sample C code and its corresponding IR bytecode text. Similarly, backend writers (code generators) should know how to convert it into machine code. 19

32 4.4 Tools in LLVM To compile the program, run optimizations, generate the bytecode or machine code there are various tools available in LLVM infrastructure. Some of these tools are explained in following sections. Program s life cycle from source program to executable in LLVM compiler is explained in Figure 11. Figure 11: Program s life cycle in LLVM compiler llvm-gcc (.c.s) This tool is used to generate the LLVM IR bytecode. It checks for syntactical errors and then produces IR bytecode of the source program. It uses.s or.ll extension [28]. For example, llvm-gcc -emit-llvm -S hellworld.c This command generates helloworld.s file in the same directory. 20

33 4.4.2 llvm-as (.s.bc) This tool is used to generate bitcode file (executable) from the IR bytecode. For example, llvm-as -f helloworld.s It creates hellowrold.s.bc file which is target specific executable [29] lli To execute LLVM bitcode file, lli command is used [28]. For example, lli helloworld.s.bc This command executes helloworld program Opt It is used to run different optimizers on IR bytecode. One can write his / her own optimizer on top of inbuilt optimizers. In this project we have written an optimizer pass to call dead code functions. Format of the command is: opt optimizer name input bytecode or bitcode file output bytecode or bitcode file For example, opt -mem2reg helloworld.s.bc mhellow.bc The file mhellow.bc is generated after running the optimizer pass mem2reg on helloworld.s.bc. File mhellow.bc is the optimized executable of the program [28] llvm-dis This tool is used to disassemble the LLVM bitcode file to IR bytecode file. llvm-dis -f mhellow.bc -o opthelloworld.s 21

34 Using this command we can check how code was optimized in mem2reg pass [29] llc This tool is used to generate native assembly-code from bitcode file. For example, llc -f opthelloworld.bc A file containing the native assembly code opthelloworld.s is created [28]. Throughout the project we used following command to disassemble the binary. llc -o file.asm -asm-verbose=0 -march=x86 -x86-asm-syntax=intel -regalloc=default file.bc Here, file.bc is the input binary file and file.asm is the output file containing assembly code llvm-link This tool takes several LLVM bitcode files and links them together and generates a single LLVM bitcode file [33]. For example, llvm-link -o finalopt.bc addcode.s.bc basic.s.bc This command generates one finalopt.bc bitcode file after linking addcode.s.bc and basic.s.bc bitcode files. 22

35 CHAPTER 5 Hidden Markov Models and Virus Detection Metamorphism is the best way to evade detection. Recent research proves that Hidden Markov Models (HMMs) are very effective in detecting metamorphic viruses [5, 11, 22, 39, 48, 49, 53]. Hidden Markov models can be viewed as a machine learning technique. In the past, a method is described to train the HMM with sequences of opcodes from viruses which belong to same family [53]. This trained HMM is then used to score binaries, to determine if the given binary belongs to a virus or benign file. Log Likelihood Per Opcode (LLPO) score is calculated and based on this score, threshold is obtained for viruses and benign executables. This threshold value categorizes executables between viruses or benign executables based on its LLPO score. The detailed working of HMM and its virus detection mechanism are explained in following sections. 5.1 HMM The Hidden Markov Model (HMM) is a statistical pattern analysis tool. The notations used in HMM are shown in Table 1 [43]: As the name suggests, a hidden Markov model includes a hidden Markov chain. Although this Markov chain is not directly observable, it is probabilistically related to sequence of observed symbols. A generic Hidden Markov Model is explained in Figure 12. Its state and observation at time t are represented by X t and O t respectively. The initial state X 0 and A matrix together determine the hidden Markov process. This Markov process is illustrated in Figure 12. O i is related to the states of Markov process by the matrices A and B. 23

36 Table 1: HMM Notations [43] Symbol Description T Length of the observed sequence N Number of states in the model M Number of distinct observation symbols O Observation sequence (O 0, O 1,..., O T 1 ) A State transition probability matrix B Observation probability distribution matrix π Initial state distribution matrix Figure 12: Generic Hidden Markov Model [43] Research in [12] indicates HMMs are used in protein modeling and speech recognition systems. HMMs can also be used to detect certain types of software piracy [20]. In general, HMM needs to be trained with the input data. It creates training models based on this training data. Each individual element in the training data is mapped to an observation symbol. All unique observation symbols are then extracted and are represented in the training model. This trained model is then used to determine whether the new sequence of observations is similar to one represented in the training model. HMM collects training data from all known viruses and builds training models one for each family of virus. A new file is tested against these models. If a new file matches with the model, then it can be identified as a virus from that family [43, 53]. 24

37 5.1.1 Example Papers [22, 43, 53] explain inner working of HMM with this simple example: Suppose one has to find out the average annual temperature for any given year by observing the tree sizes (S-small, M-medium and L-large). To keep it simple, assume the annual temperature is either hot (H) or cold (C). In addition to this, the probability of annual temperature trend is known. Probability of a hot year followed by another hot year (HH) is 0.7, a hot year followed by a cold year (HC) is 0.3, a cold year followed by a hot year is 0.4 and a cold year followed by another cold year (CC) is 0.6 [43]. The matrix representation of these probabilities is: The correlation between tree sizes and temperature is also known. In a hot year, the probability of tree size being small is 0.1, being medium is 0.4 and being large is 0.5. In a cold year, the probability of tree size being small is 0.7, being medium is 0.2 and being large is 0.1 [43]. The matrix representation of this information is as follows: This known information can be mapped to HMM notations. States are represented by annual temperatures, tree sizes are observable symbols. States H and C are hidden, as we cannot see the temperature in the past. We have access to see the observation symbols (S, M and L). With this known information, we can build the HMM model described in Figure

38 Figure 13: HMM Model [43] Consider we have information of tree size sequence for four consecutive years and it is (S, M, S, L) and by using this information we want to find the annual temperature sequences. To solve this problem using HMM, its parameters are explained as follows [43]: 1. State transition probability matrix A = [ 0.7 ] The observation probability distribution matrix [ ] B = Number of states (H and C) in the model N = Number of distinct observation symbols (S, M and L) M = Given initial state distribution matrix : π = [ ] 26

39 HMM steps to determine transitions of length T = 4 with given observations (S, M, S, L) are as follows [43]: 1. Determine all state transitions which is nothing but N T. 2. Calculate the probability for each state transition (Table 2) with the given observation sequence. For example, the probability of sequence HHCC can be calculated by using following formula: P (HHCC) = π H b H (S)a H,H b H (M)a H,C b C (S)a C,C b C (L) = (0.6)(0.1)(0.7)(0.4)(0.3)(0.7)(0.6)(0.1) = We know that annual temperature sequence will be the one with the highest probability. In Table 2 [43], CCCH has the highest probability so answer is CCCH sequence. The brute force method applied here requires exponential amount of work which is infeasible. HMM includes following algorithms to solve three problems. 5.2 The Three Problems Following three problems can be efficiently solved using HMMs [22, 42, 43]: Problem 1. The model λ = (A, B, π) and an observation sequence O is known, find P(O λ). Problem 2. The model λ = (A, B, π) is known, find an optimal state sequence for the Markov process. 27

40 Table 2: Probabilities of observing (S, M, S, L) for all possible state sequences state sequence probability HHHH HHHC HHCH HHCC HCHH HCHC HCCH HCCC CHHH CHHC CHCH CHCC CCHH CCHC CCCH CCCC Problem 3. An observation sequence O, the number of unique symbols M and the number of states N are known, find the model λ = (A, B, π) that maximizes the probability of O. First problem concentrates on determining the likelihood of an observation sequence using the model λ. Second problem concentrates on exposing the hidden part of the HMM. The third problem concentrates on training the HMM with input observation sequence O and the parameters M and N. In this paper, we first train a model (Problem 3) on opcode sequences derived from a base piece of software. Then use the trained model to score (Problem 1) morphed versions of this base software. Previous research has shown that HMMs are effective at detecting metamorphic malware, and that HMMs can also be used to detect certain types of software piracy [20]. That is, HMMs have proven useful at detecting morphed or disguised versions of code. Conse- 28

41 quently, HMM analysis provides a challenging test for any code morphing technique Forward Algorithm This algorithm or α pass determines P(O λ) [42, 43]. For t = 0, 1,..., T 1 and i = 0, 1,... N 1, define α t (i) = P (O 0, O 1,..., O t, x t = q i λ) The probability of the partial observation sequence up to time t is α t (i). Using the forward algorithm, P(O λ) can be computed as shown below: 1. Let α 0 (i) = π i b i (O 0 ), fori = 0, 1,..., N 1 2. For t = 1, 2,..., T 1 and i = 0, 1,... N 1, compute [43] ( N 1 ) α t (i) = α t 1 (j)a ij b i (O t ) 3. P(O λ) = N 1 i=0 α T 1 (i) j= Backward Algorithm This algorithm helps to find a most likely optimal state sequence For t = 0, 1,..., T 1 and i = 0, 1,..., N 1, define [22, 42, 43] β t (i) = P (O t+1, O t+2,..., O T 1 x t = q i, λ) Efficiently β t (i) can be calculated in following steps: 1. Let β T 1 (i) = 1, for i = 0, 1,..., N 1 29

42 2. For t = T 2, T 3,..., 0 and i = 0, 1,..., N 1, compute [43] β t (i) = N 1 j=0 a ij b j (O t+1 )β t+1 (j) For t = 0, 1,..., T 2 and i = 0, 1,..., N 1, define γ t (i) = P (x t = q i O, λ) Relevant probability up to time t is measured by α t (i), and after t is measured by β t (i) [43], γ t (i) = α t(i)β t (i) P (O λ) The most likely state at any time t is the state for which γ t (i) is maximum Baum-Welch Algorithm By adjusting the model parameters, this algorithm provides efficient way to best fit the observations. In this algorithm, number of states N and number of unique observation symbols M are constant. However, other parameters like A, B and π are changeable with row stochastic condition. The process of re-estimating the model is explained below [42, 43]: 1. Initialize λ = (A, B, π) with an approximate guess or random values. For example π = 1/N, A ij = 1/N, B ij = 1/M. 2. Compute α t (i), β t (i), γ t (i) and γ t (i, j) where γ t (i, j) is di-gamma. This Digammas can be defined as [42]: γ t (i) and γ t (i, j) are related by: γ t (i) = α t(i)a ij b j (O t+1 )β t+1 (j) P (O λ) γ t (i) = N 1 j=0 γ t (i, j) 30

43 3. Re-estimate model parameters as : For i = 0, 1,..., N 1 let π i = γ 0 (i) For i = 0, 1,..., N 1 and j = 0, 1,..., N 1, compute [42] a ij = T 2 γ t (i, j) / T 2 t=0 t=0 γ t (i) For j = 0, 1,..., N 1 and k = 0, 1,..., M 1, compute [42] b j (k) = γ t (j) / T 2 t {0,1,...,T 2},O t=k t=0 γ t (j) 4. If P (O λ) increases go to step HMM and Virus Detection Effectiveness of HMM for metamorphic virus detection is explained in detail in [12, 23]. HMM as a virus detection tool needs to train with the input data to generate training model. Trained HMM, represents the statistical properties of the virus family. These trained models are then used to determine the score of a new binary file. This score indicates how close a new binary file is to the virus family that the model represents. Based on threshold values, we can then categorize files. To train the HMM, first set of virus files belonging to the same family are disassembled. From each disassembled file, unique assembly opcodes are extracted. These opcodes constitute the HMM symbols. Example of extracted opcodes is shown in Figure 14. A long observation sequence is generated by concatenating opcode sequences from all virus files within the same family. This concatenated sequence then is used to train an HMM. The set of unique opcodes from this long sequence serve as the set of distinct observation symbols. The example of HMM model is shown in Figure

44 Figure 14: Extracted opcode sequence Log Likelihood Per Opcode In scoring observation sequences to train HMM, product of probabilities needs to be computed. As T increases, product tends to 0 exponentially. To avoid this problem, forward and backward algorithms are used to normalize the result. This process is explained in [43]. T 1 log[p (O λ)] = log c j Above equation represents log likelihood. It is sum of log transition probabilities, so log likelihood is length dependent. Longer the sequence, higher will be the log observation probability. The sequence in test set can differ in length comparing to j=0 32

45 Figure 15: HMM Model sequences used to train the HMM. To obtain the LLPO, divide the log likelihood by the number of opcodes in the sequence [53] Effectiveness of HMM Detection The effectiveness of HMM in detecting metamorphic virus is proven [53]. Virus files generated by metamorphic code generator NGVCK generate significantly different files, also got detected by HMM in [53]. The detection rate of HMM is almost 33

46 90% [12]. 5.4 Evading HMM Detection Researchers have tried to write the metamorphic engine to evade HMM detection [22]. The dead code is inserted into the virus files based on dynamic scoring algorithm. The block of dead code is inserted into the virus file only if doing so increases the likelihood of virus file score getting closer to the score of benign files. Results in [23] indicate inserting long sequence of opcodes, like subroutines are more effective in avoiding HMM detection than randomly inserting blocks of dead code. The HMM detector failed when 35% of dead blocks and 30% subroutines were inserted from benign files to virus file. The metamorphic code generator presented in this paper makes use of these results while inserting dead code. 34

47 CHAPTER 6 Design and Implementation 6.1 Introduction Some of the metamorphic techniques explained in Chapter 3 are implemented at IR bytecode level instead of at assembly code. It has been seen in the past that conditional code obfuscation techniques are implemented at IR bytecode level by writing LLVM optimizer passes [17]. Malware writers have also tried to implement shadow attack in [35]. The aim of the project is to produce multiple base software copies that are hard to detect and significantly different from each other. When a program is compiled with this optimizer, it generates significantly different morphed copy of the base software. Even after implementing all metamorphic techniques, HMM detector developed in [53] is able to classify virus files and benign files correctly. An unsuccessful attempt was made [11] to escape from HMM-based detector. This proves HMM is very effective in detection. Therefore, our aim is to write metamorphic code generator that evades HMM based virus detector. 6.2 Challenge and Innovation Morphing code at IR level offers following advantages [2]. These advantages include LLVM was originally implemented for C and C++, but its language-agnostic design has spawned a wide variety of front ends which include Objective-C, FORTRAN, Ada, Haskell, Java bytecode, Python, Ruby, Action Script, GLSL, 35

48 D, and Rust. Code written in any of the above language can use our code generator to produce metamorphic copies. The intermediate form is platform independent. At IR level, virtual addresses are not assigned. Addresses get assigned at bitcode level. So, by morphing at IR level, we avoid difficulties associated with morphing at the assembly level. 6.3 Goals Morphed copies of code should have same functionality as base file. In addition, the higher the percentage of inserted or modified code, the more the morphed files should differ (on average) from the base file. A morphed base file will look like a morphing file, if its opcode counts and opcode sequences are more like morphing files than base file. As previously mentioned, HMMs have a proven record of being able to effective see through metamorphic cide. Consequently, if we can morph code sufficiently to confuse HMM-based analysis this will provide a strong indication of the success of our morphing strategy. 6.4 Metamorphic Techniques Used As we are morphing at IR bytecode level, it is difficult to adopt some of the techniques described in Chapter 3. For example, register swapping and equivalent instruction substitution are relatively difficult to implement at the IR level. Therefore, to provide a proof of concept, we have restricted our code morphing to a combination of dead code insertion and subroutine permutation. We accomplish both of these morphing strategies by inserting randomy selected complete subroutines of dead code from other program files. In addition, the order of these dead subroutines is ran- 36

49 domized. In this way, we create a singnificant amount of transposition and dead variation between different morphed copies. In addition, we insert call statements to all dead code subroutines so that they are not trivially identifiable as dead code. These techniques are discussed in following sections: Dead Code Insertion Dead code insertion involves inserting instructions whose result is never used in any other computation. The main goal of adding this code is to increase the diversity of opcodes. Since, we are implementing it at IR bytecode level which has RISC like instruction set; it is difficult to add dead instructions. However, functions can be easily inserted by using linker tool llvm-link. To insert dead code we need IR bytecode of morphing files. We have used core-util Linux command files [24] and files from httpd web browser [4] to insert the dead code. These files include system level code to do operations that we would expect to be somewhat similar to our selected base code. A detailed algorithm is explained in Section $ LLVM provides options to optimize and remove dead code. Anti-virus softwares are also smart enough to identify code snippets which are not actually getting executed. They can track to execution sequence and detect the virus. To make the metamorphic code generator smarter, we call the inserted dead code. A detailed algorithm is explained in Section $ Function Permutation As explained in Chapter 3, function permutation is reordering the layout of function definitions. Since IR bytecode is in the textual form, function permutations 37

50 can be implemented easily by changing the sequence of the functions. This helps to evade any pattern matching detector. We have written Python script to operate on IR bytecode and produce another text file which has same functions but with a different layout. A detailed algorithm is explained in Section $ Implementation In this project we have developed three passes. These passes and their algorithms are explained in following subsections. The high level architecture of our morphing engine appears in Figure 16. Figure 16: Metamorphic code generator architecture diagram Dead Code Insertion First pass operates on inserting the dead code. A base file, a morphing file (i.e., our source of dead code), and a dead code percentage are specified. Based on the percentage of dead code, we determine total number of lines we want to insert 38

51 into the base file. We then select complete functions from the morphing file so that the total size approximates the number of lines we want to insert into the base file. These subroutines are integrated into the base file at the linking stage. In the output it provides function names it has inserted. It also distinguishes the output dead function names which can be called in pass 2. The details of this first pass of our code morping techniques are given below. 1. Compile selected morphing file using llvm-gcc command and generate its IR bytecode. 2. From this IR bytecode, determine the function dependencies. 3. For each function, calculate its number of lines. 4. Based on total number of dead code lines, use a greedy strategy to determine a subset of functions which best approximates the number of lines to be inserted. 5. Copy selected functions to a temporary IR bytecode file. 6. Create bitcode files for the base code and temporary IR bytecode using llvm-as. 7. Merge these two files (using llvm-link). 8. If there are any subrouttine naming conflicts, replace each offending name in the temporary IR bytecode file with a random string. Goto Delete the temporary IR bytecode file Call Dead Functions In this pass, we use the LLVM optimizer to insert a call instruction for each dead code subroutine. As mentioned in Section $6.5.1, pass 1 identifies dead functions 39

52 which can be called using this pass. The optimizer takes function name as input. It then finds the main function definition in the IR bytecode and inserts a call type of instruction after every load type of instruction. This optimizer operates on Module class. The current implementation does not support structure type of parameters and pointers except single pointers. For each dead code subroutine, we perform the following steps. 1. Find the function object of the main function. 2. Iterate over instructions in the function object. If an instruction is of type load, then insert the call instruction. 3. To insert the call instruction, iterate over its function parameters. For each parameter, allocate memory and initialize with a random value. 4. Finally, insert a call instruction Function Permutation Third pass performs function permutations by simply reordering functions in the IR bytecode file. This pass is written in python script. Its algorithm is explained in following steps. 1. Read IR bytecode file. 2. Write all global variables in temporary file. 3. By using random class generator, generate random number between 1 to total number of functions. 4. If function is not already added, then write this function definition in temporary IR bytecode file. 40

Metamorphic Code Generator based on bytecode of LLVM IR

Metamorphic Code Generator based on bytecode of LLVM IR San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Metamorphic Code Generator based on bytecode of LLVM IR Arjun Shah Follow this and additional

More information

HUNTING FOR METAMORPHIC ENGINES

HUNTING FOR METAMORPHIC ENGINES HUNTING FOR METAMORPHIC ENGINES Mark Stamp & Wing Wong August 5, 2006 Outline I. Metamorphic software II. III. IV. Both good and evil uses Metamorphic virus construction kits How effective are metamorphic

More information

Simple Substitution Distance and Metamorphic Detection

Simple Substitution Distance and Metamorphic Detection San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2012 Simple Substitution Distance and Metamorphic Detection Gayathri Shanmugam San Jose State University

More information

Undetectable Metamorphic Viruses. COMP 116 Amit Patel

Undetectable Metamorphic Viruses. COMP 116 Amit Patel Undetectable Metamorphic Viruses COMP 116 Amit Patel Abstract Signature scanning is an efficient technique employed by anti-virus systems to detect known malware. Signature scanning involves scanning files

More information

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements

More information

DETECTING UNDETECTABLE COMPUTER VIRUSES

DETECTING UNDETECTABLE COMPUTER VIRUSES San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2010 DETECTING UNDETECTABLE COMPUTER VIRUSES Sujandharan Venkatachalam San Jose State University Follow

More information

Static Analysis of Malicious Java Applets

Static Analysis of Malicious Java Applets San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-15-2015 Static Analysis of Malicious Java Applets Nikitha Ganesh SJSU Follow this and additional

More information

Hunting for Pirated Software Using Metamorphic Analysis

Hunting for Pirated Software Using Metamorphic Analysis San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2014 Hunting for Pirated Software Using Metamorphic Analysis Hardikkumar Rana Follow this and

More information

Virus Analysis. Introduction to Malware. Common Forms of Malware

Virus Analysis. Introduction to Malware. Common Forms of Malware Virus Analysis Techniques, Tools, and Research Issues Part I: Introduction Michael Venable Arun Lakhotia, USA Introduction to Malware Common Forms of Malware Detection Techniques Anti-Detection Techniques

More information

Metamorphic Viruses with Built-In Buffer Overflow

Metamorphic Viruses with Built-In Buffer Overflow San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2010 Metamorphic Viruses with Built-In Buffer Overflow Ronak Shah San Jose State University Follow this

More information

Lecture 12 Malware Defenses. Stephen Checkoway University of Illinois at Chicago CS 487 Fall 2017 Slides based on Bailey s ECE 422

Lecture 12 Malware Defenses. Stephen Checkoway University of Illinois at Chicago CS 487 Fall 2017 Slides based on Bailey s ECE 422 Lecture 12 Malware Defenses Stephen Checkoway University of Illinois at Chicago CS 487 Fall 2017 Slides based on Bailey s ECE 422 Malware review How does the malware start running? Logic bomb? Trojan horse?

More information

Using Hidden Markov Models to Detect DNA Motifs

Using Hidden Markov Models to Detect DNA Motifs San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-13-2015 Using Hidden Markov Models to Detect DNA Motifs Santrupti Nerli San Jose State University

More information

Improved Signature-Based Antivirus System

Improved Signature-Based Antivirus System Improved Signature-Based Antivirus System Osaghae E. O. Department of Computer Science Federal University, Lokoja, Kogi State, Nigeria Abstract: The continuous updating of antivirus database with malware

More information

Hunting for Metamorphic JavaScript Malware

Hunting for Metamorphic JavaScript Malware Hunting for Metamorphic JavaScript Malware Mangesh Musale Thomas H. Austin Mark Stamp Abstract The Internet plays a major role in the propagation of malware. A recent trend is the infection of machines

More information

CSCD 303 Essential Computer Security Fall 2017

CSCD 303 Essential Computer Security Fall 2017 CSCD 303 Essential Computer Security Fall 2017 Lecture 13 - Malware Evasion, Prevention, Detection, Removal Reading: Chapter 6 CompTIA Book, Links Overview Malware Techniques for Evasion Detection/Removal

More information

Cryptanalysis of Homophonic Substitution Cipher Using Hidden Markov Models

Cryptanalysis of Homophonic Substitution Cipher Using Hidden Markov Models San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 12-20-2016 Cryptanalysis of Homophonic Substitution Cipher Using Hidden Markov Models Guannan Zhong

More information

Intrusion Detection in Containerized Environments

Intrusion Detection in Containerized Environments San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2018 Intrusion Detection in Containerized Environments Shyam Sundar Durairaju San Jose State University

More information

Random Code Variation Compilation Automated software diversity s performance penalties

Random Code Variation Compilation Automated software diversity s performance penalties Random Code Variation Compilation Automated software diversity s performance penalties Master s thesis in Computer Science algorithms, languages and logic Christoffer Hao Andersson Department of Computer

More information

Evading Network Anomaly Detection Sytems - Fogla,Lee. Divya Muthukumaran

Evading Network Anomaly Detection Sytems - Fogla,Lee. Divya Muthukumaran Evading Network Anomaly Detection Sytems - Fogla,Lee Divya Muthukumaran Intrusion detection Systems Signature Based IDS Monitor packets on the network Compare them against database of signatures/attributes

More information

CSCD 303 Essential Computer Security Fall 2018

CSCD 303 Essential Computer Security Fall 2018 CSCD 303 Essential Computer Security Fall 2018 Lecture 10 - Malware Evasion, Prevention, Detection, Removal Reading: Chapter 6 CompTIA Book, Links Overview Malware Techniques for Evasion Detection/Removal

More information

CODE OBFUSCATION AND VIRUS DETECTION

CODE OBFUSCATION AND VIRUS DETECTION CODE OBFUSCATION AND VIRUS DETECTION A Writing Project Presented to The Faculty of the Department of Computer Science San Jose State University In Partial Fulfillment Of the Requirements for the Degree

More information

Practical Detection of Metamorphic Computer Viruses

Practical Detection of Metamorphic Computer Viruses San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 28 Practical Detection of Metamorphic Computer Viruses Sharmidha Govindaraj San Jose State University

More information

Dueling-HMM Analysis on Masquerade Detection

Dueling-HMM Analysis on Masquerade Detection San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 4-22-2016 Dueling-HMM Analysis on Masquerade Detection Peter Chou San Jose State University Follow

More information

COP4020 Programming Languages. Compilers and Interpreters Robert van Engelen & Chris Lacher

COP4020 Programming Languages. Compilers and Interpreters Robert van Engelen & Chris Lacher COP4020 ming Languages Compilers and Interpreters Robert van Engelen & Chris Lacher Overview Common compiler and interpreter configurations Virtual machines Integrated development environments Compiler

More information

Intruders. significant issue for networked systems is hostile or unwanted access either via network or local can identify classes of intruders:

Intruders. significant issue for networked systems is hostile or unwanted access either via network or local can identify classes of intruders: Intruders significant issue for networked systems is hostile or unwanted access either via network or local can identify classes of intruders: masquerader misfeasor clandestine user varying levels of competence

More information

Cryptanalysis of Homophonic Substitution- Transposition Cipher

Cryptanalysis of Homophonic Substitution- Transposition Cipher San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2014 Cryptanalysis of Homophonic Substitution- Transposition Cipher Jeffrey Yi Follow this and

More information

Intermediate Representations

Intermediate Representations COMP 506 Rice University Spring 2018 Intermediate Representations source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved. Students

More information

Contradictions in Improving Speed of Virus Scanning

Contradictions in Improving Speed of Virus Scanning From the SelectedWorks of Umakant Mishra May, 2013 Contradictions in Improving Speed of Virus Scanning Umakant Mishra Available at: https://works.bepress.com/umakant_mishra/109/ Contradictions in Improving

More information

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code

More information

Bayesian Classification Using Probabilistic Graphical Models

Bayesian Classification Using Probabilistic Graphical Models San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2014 Bayesian Classification Using Probabilistic Graphical Models Mehal Patel San Jose State University

More information

Artificial Immune System against Viral Attack

Artificial Immune System against Viral Attack Artificial Immune System against Viral Attack Hyungjoon Lee 1, Wonil Kim 2*, and Manpyo Hong 1 1 Digital Vaccine Lab, G,raduated School of Information and Communication Ajou University, Suwon, Republic

More information

Intermediate Code & Local Optimizations

Intermediate Code & Local Optimizations Lecture Outline Intermediate Code & Local Optimizations Intermediate code Local optimizations Compiler Design I (2011) 2 Code Generation Summary We have so far discussed Runtime organization Simple stack

More information

Detecting Self-Mutating Malware Using Control-Flow Graph Matching

Detecting Self-Mutating Malware Using Control-Flow Graph Matching Detecting Self-Mutating Malware Using Control-Flow Graph Matching Danilo Bruschi Lorenzo Martignoni Mattia Monga Dipartimento di Informatica e Comunicazione Università degli Studi di Milano {bruschi,martign,monga}@dico.unimi.it

More information

Eigenviruses for metamorphic virus recognition

Eigenviruses for metamorphic virus recognition Published in IET Information Security Received on 12th June 2010 Revised on 22nd February 2011 Eigenviruses for metamorphic virus recognition M.E. Saleh 1 A.B. Mohamed 2 A.A. Nabi 3 ISSN 1751-8709 1 Integrated

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

A Security Microcosm Attacking and Defending Shiva

A Security Microcosm Attacking and Defending Shiva A Security Microcosm Attacking and Defending Shiva Shiva written by Neel Mehta and Shaun Clowes Presented by Shaun Clowes shaun@securereality.com.au What is Shiva? Shiva is an executable encryptor Encrypted

More information

LLVM & LLVM Bitcode Introduction

LLVM & LLVM Bitcode Introduction LLVM & LLVM Bitcode Introduction What is LLVM? (1/2) LLVM (Low Level Virtual Machine) is a compiler infrastructure Written by C++ & STL History The LLVM project started in 2000 at the University of Illinois

More information

Profile Hidden Markov Models and Metamorphic Virus Detection

Profile Hidden Markov Models and Metamorphic Virus Detection Profile Hidden Markov Models and Metamorphic Virus Detection Srilatha Attaluri, Scott McGhee, and Mark Stamp Department of Computer Science San Jose State University San Jose, California Abstract Metamorphic

More information

What do Compilers Produce?

What do Compilers Produce? What do Compilers Produce? Pure Machine Code Compilers may generate code for a particular machine, not assuming any operating system or library routines. This is pure code because it includes nothing beyond

More information

CSE P 501 Compilers. Intermediate Representations Hal Perkins Spring UW CSE P 501 Spring 2018 G-1

CSE P 501 Compilers. Intermediate Representations Hal Perkins Spring UW CSE P 501 Spring 2018 G-1 CSE P 501 Compilers Intermediate Representations Hal Perkins Spring 2018 UW CSE P 501 Spring 2018 G-1 Administrivia Semantics/types/symbol table project due ~2 weeks how goes it? Should be caught up on

More information

T Jarkko Turkulainen, F-Secure Corporation

T Jarkko Turkulainen, F-Secure Corporation T-110.6220 2010 Emulators and disassemblers Jarkko Turkulainen, F-Secure Corporation Agenda Disassemblers What is disassembly? What makes up an instruction? How disassemblers work Use of disassembly In

More information

Pairwise Alignment of Metamorphic Computer Viruses

Pairwise Alignment of Metamorphic Computer Viruses San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2007 Pairwise Alignment of Metamorphic Computer Viruses Scott McGhee San Jose State University Follow

More information

Instantaneously trained neural networks with complex inputs

Instantaneously trained neural networks with complex inputs Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2003 Instantaneously trained neural networks with complex inputs Pritam Rajagopal Louisiana State University and Agricultural

More information

DEFEATING MASQUERADE DETECTION

DEFEATING MASQUERADE DETECTION San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 DEFEATING MASQUERADE DETECTION Avani Kothari San Jose State University Follow this and additional

More information

Detecting Metamorphic Computer Viruses using Supercompilation

Detecting Metamorphic Computer Viruses using Supercompilation Detecting Metamorphic Computer Viruses using Supercompilation Alexei Lisitsa and Matt Webster In this paper we present a novel approach to detection of metamorphic computer viruses by proving program equivalence

More information

CS 406/534 Compiler Construction Putting It All Together

CS 406/534 Compiler Construction Putting It All Together CS 406/534 Compiler Construction Putting It All Together Prof. Li Xu Dept. of Computer Science UMass Lowell Fall 2004 Part of the course lecture notes are based on Prof. Keith Cooper, Prof. Ken Kennedy

More information

CSE 401/M501 Compilers

CSE 401/M501 Compilers CSE 401/M501 Compilers Intermediate Representations Hal Perkins Autumn 2018 UW CSE 401/M501 Autumn 2018 G-1 Agenda Survey of Intermediate Representations Graphical Concrete/Abstract Syntax Trees (ASTs)

More information

About the Authors... iii Introduction... xvii. Chapter 1: System Software... 1

About the Authors... iii Introduction... xvii. Chapter 1: System Software... 1 Table of Contents About the Authors... iii Introduction... xvii Chapter 1: System Software... 1 1.1 Concept of System Software... 2 Types of Software Programs... 2 Software Programs and the Computing Machine...

More information

Khmer OCR for Limon R1 Size 22 Report

Khmer OCR for Limon R1 Size 22 Report PAN Localization Project Project No: Ref. No: PANL10n/KH/Report/phase2/002 Khmer OCR for Limon R1 Size 22 Report 09 July, 2009 Prepared by: Mr. ING LENG IENG Cambodia Country Component PAN Localization

More information

Introduction. CS 2210 Compiler Design Wonsun Ahn

Introduction. CS 2210 Compiler Design Wonsun Ahn Introduction CS 2210 Compiler Design Wonsun Ahn What is a Compiler? Compiler: A program that translates source code written in one language to a target code written in another language Source code: Input

More information

Overcoming limitations of Signature scanning - Applying TRIZ to Improve Anti-Virus Programs

Overcoming limitations of Signature scanning - Applying TRIZ to Improve Anti-Virus Programs From the SelectedWorks of Umakant Mishra January, 2012 Overcoming limitations of Signature scanning - Applying TRIZ to Improve Anti-Virus Programs Umakant Mishra Available at: https://works.bepress.com/umakant_mishra/81/

More information

An Introduction to Virus Scanners

An Introduction to Virus Scanners From the SelectedWorks of Umakant Mishra August, 2010 An Introduction to Virus Scanners Umakant Mishra Available at: https://works.bepress.com/umakant_mishra/76/ An Introduction to Virus Scanners Umakant

More information

Introduction to LLVM. UG3 Compiling Techniques Autumn 2018

Introduction to LLVM. UG3 Compiling Techniques Autumn 2018 Introduction to LLVM UG3 Compiling Techniques Autumn 2018 Contact Information Instructor: Aaron Smith Email: aaron.l.smith@ed.ac.uk Office: IF 1.29 TA for LLVM: Andrej Ivanis Email: andrej.ivanis@ed.ac.uk

More information

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions) By the end of this course, students should CIS 1.5 Course Objectives a. Understand the concept of a program (i.e., a computer following a series of instructions) b. Understand the concept of a variable

More information

Compilers and Code Optimization EDOARDO FUSELLA

Compilers and Code Optimization EDOARDO FUSELLA Compilers and Code Optimization EDOARDO FUSELLA The course covers Compiler architecture Pre-requisite Front-end Strong programming background in C, C++ Back-end LLVM Code optimization A case study: nu+

More information

CERT-In. Indian Computer Emergency Response Team ANTI VIRUS POLICY & BEST PRACTICES

CERT-In. Indian Computer Emergency Response Team ANTI VIRUS POLICY & BEST PRACTICES CERT-In Indian Computer Emergency Response Team ANTI VIRUS POLICY & BEST PRACTICES Department of Information Technology Ministry of Communications and Information Technology Government of India Anti Virus

More information

Symbol Tables Symbol Table: In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is

More information

Polymorphic Blending Attacks. Slides by Jelena Mirkovic

Polymorphic Blending Attacks. Slides by Jelena Mirkovic Polymorphic Blending Attacks Slides by Jelena Mirkovic 1 Motivation! Polymorphism is used by malicious code to evade signature-based IDSs Anomaly-based IDSs detect polymorphic attacks because their byte

More information

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine.

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine. 1. true / false By a compiler we mean a program that translates to code that will run natively on some machine. 2. true / false ML can be compiled. 3. true / false FORTRAN can reasonably be considered

More information

PLT Course Project Demo Spring Chih- Fan Chen Theofilos Petsios Marios Pomonis Adrian Tang

PLT Course Project Demo Spring Chih- Fan Chen Theofilos Petsios Marios Pomonis Adrian Tang PLT Course Project Demo Spring 2013 Chih- Fan Chen Theofilos Petsios Marios Pomonis Adrian Tang 1. Introduction To Code Obfuscation 2. LLVM 3. Obfuscation Techniques 1. String Transformation 2. Junk Code

More information

CST-402(T): Language Processors

CST-402(T): Language Processors CST-402(T): Language Processors Course Outcomes: On successful completion of the course, students will be able to: 1. Exhibit role of various phases of compilation, with understanding of types of grammars

More information

1. Describe History of C++? 2. What is Dev. C++? 3. Why Use Dev. C++ instead of C++ DOS IDE?

1. Describe History of C++? 2. What is Dev. C++? 3. Why Use Dev. C++ instead of C++ DOS IDE? 1. Describe History of C++? The C++ programming language has a history going back to 1979, when Bjarne Stroustrup was doing work for his Ph.D. thesis. One of the languages Stroustrup had the opportunity

More information

2 rd class Department of Programming. OOP with Java Programming

2 rd class Department of Programming. OOP with Java Programming 1. Structured Programming and Object-Oriented Programming During the 1970s and into the 80s, the primary software engineering methodology was structured programming. The structured programming approach

More information

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis CS 1622 Lecture 2 Lexical Analysis CS 1622 Lecture 2 1 Lecture 2 Review of last lecture and finish up overview The first compiler phase: lexical analysis Reading: Chapter 2 in text (by 1/18) CS 1622 Lecture

More information

CS5363 Final Review. cs5363 1

CS5363 Final Review. cs5363 1 CS5363 Final Review cs5363 1 Programming language implementation Programming languages Tools for describing data and algorithms Instructing machines what to do Communicate between computers and programmers

More information

Compilers and Interpreters

Compilers and Interpreters Overview Roadmap Language Translators: Interpreters & Compilers Context of a compiler Phases of a compiler Compiler Construction tools Terminology How related to other CS Goals of a good compiler 1 Compilers

More information

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation Language Implementation Methods The Design and Implementation of Programming Languages Compilation Interpretation Hybrid In Text: Chapter 1 2 Compilation Interpretation Translate high-level programs to

More information

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection Fighting Spam, Phishing and Malware With Recurrent Pattern Detection White Paper September 2017 www.cyren.com 1 White Paper September 2017 Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

More information

Practical Malware Analysis

Practical Malware Analysis Practical Malware Analysis Ch 4: A Crash Course in x86 Disassembly Revised 1-16-7 Basic Techniques Basic static analysis Looks at malware from the outside Basic dynamic analysis Only shows you how the

More information

Compiling Techniques

Compiling Techniques Lecture 2: The view from 35000 feet 19 September 2017 Table of contents 1 2 Passes Representations 3 Instruction Selection Register Allocation Instruction Scheduling 4 of a compiler Source Compiler Machine

More information

Life Cycle of Source Program - Compiler Design

Life Cycle of Source Program - Compiler Design Life Cycle of Source Program - Compiler Design Vishal Trivedi * Gandhinagar Institute of Technology, Gandhinagar, Gujarat, India E-mail: raja.vishaltrivedi@gmail.com Abstract: This Research paper gives

More information

Operating System Services

Operating System Services CSE325 Principles of Operating Systems Operating System Services David Duggan dduggan@sandia.gov January 22, 2013 Reading Assignment 3 Chapter 3, due 01/29 1/23/13 CSE325 - OS Services 2 What Categories

More information

SMD149 - Operating Systems - File systems

SMD149 - Operating Systems - File systems SMD149 - Operating Systems - File systems Roland Parviainen November 21, 2005 1 / 59 Outline Overview Files, directories Data integrity Transaction based file systems 2 / 59 Files Overview Named collection

More information

Compiling and Interpreting Programming. Overview of Compilers and Interpreters

Compiling and Interpreting Programming. Overview of Compilers and Interpreters Copyright R.A. van Engelen, FSU Department of Computer Science, 2000 Overview of Compilers and Interpreters Common compiler and interpreter configurations Virtual machines Integrated programming environments

More information

Multiple Choice Questions. Chapter 5

Multiple Choice Questions. Chapter 5 Multiple Choice Questions Chapter 5 Each question has four choices. Choose most appropriate choice of the answer. 1. Developing program in high level language (i) facilitates portability of nonprocessor

More information

6. Intermediate Representation

6. Intermediate Representation 6. Intermediate Representation Oscar Nierstrasz Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture notes. http://www.cs.ucla.edu/~palsberg/

More information

Topic IV. Block-structured procedural languages Algol and Pascal. References:

Topic IV. Block-structured procedural languages Algol and Pascal. References: References: Topic IV Block-structured procedural languages Algol and Pascal Chapters 5 and 7, of Concepts in programming languages by J. C. Mitchell. CUP, 2003. Chapters 10( 2) and 11( 1) of Programming

More information

Automated static deobfuscation in the context of Reverse Engineering

Automated static deobfuscation in the context of Reverse Engineering Automated static deobfuscation in the context of Reverse Engineering Sebastian Porst (sebastian.porst@zynamics.com) Christian Ketterer (cketti@gmail.com) Sebastian zynamics GmbH Lead Developer BinNavi

More information

The View from 35,000 Feet

The View from 35,000 Feet The View from 35,000 Feet This lecture is taken directly from the Engineering a Compiler web site with only minor adaptations for EECS 6083 at University of Cincinnati Copyright 2003, Keith D. Cooper,

More information

Modern Buffer Overflow Prevention Techniques: How they work and why they don t

Modern Buffer Overflow Prevention Techniques: How they work and why they don t Modern Buffer Overflow Prevention Techniques: How they work and why they don t Russ Osborn CS182 JT 4/13/2006 1 In the past 10 years, computer viruses have been a growing problem. In 1995, there were approximately

More information

1DL321: Kompilatorteknik I (Compiler Design 1)

1DL321: Kompilatorteknik I (Compiler Design 1) Administrivia 1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation Lecturer: Kostis Sagonas (kostis@it.uu.se) Course home page: http://www.it.uu.se/edu/course/homepage/komp/ht16

More information

Pioneering Compiler Design

Pioneering Compiler Design Pioneering Compiler Design NikhitaUpreti;Divya Bali&Aabha Sharma CSE,Dronacharya College of Engineering, Gurgaon, Haryana, India nikhita.upreti@gmail.comdivyabali16@gmail.com aabha6@gmail.com Abstract

More information

Compilers. History of Compilers. A compiler allows programmers to ignore the machine-dependent details of programming.

Compilers. History of Compilers. A compiler allows programmers to ignore the machine-dependent details of programming. Compilers Compilers are fundamental to modern computing. They act as translators, transforming human-oriented programming languages into computer-oriented machine languages. To most users, a compiler can

More information

Extending Yioop! Abilities to Search the Invisible Web

Extending Yioop! Abilities to Search the Invisible Web San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2012 Extending Yioop! Abilities to Search the Invisible Web Tanmayee Potluri San Jose State University

More information

OptiCode: Machine Code Deobfuscation for Malware Analysis

OptiCode: Machine Code Deobfuscation for Malware Analysis OptiCode: Machine Code Deobfuscation for Malware Analysis NGUYEN Anh Quynh, COSEINC CONFidence, Krakow - Poland 2013, May 28th 1 / 47 Agenda 1 Obfuscation problem in malware analysis

More information

code pattern analysis of object-oriented programming languages

code pattern analysis of object-oriented programming languages code pattern analysis of object-oriented programming languages by Xubo Miao A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen s

More information

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Traditional Three-pass Compiler

More information

High-Level Language VMs

High-Level Language VMs High-Level Language VMs Outline Motivation What is the need for HLL VMs? How are these different from System or Process VMs? Approach to HLL VMs Evolutionary history Pascal P-code Object oriented HLL VMs

More information

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as 372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct

More information

SentinelOne Technical Brief

SentinelOne Technical Brief SentinelOne Technical Brief SentinelOne unifies prevention, detection and response in a fundamentally new approach to endpoint protection, driven by machine learning and intelligent automation. By rethinking

More information

Topic IV. Parameters. Chapter 5 of Programming languages: Concepts & constructs by R. Sethi (2ND EDITION). Addison-Wesley, 1996.

Topic IV. Parameters. Chapter 5 of Programming languages: Concepts & constructs by R. Sethi (2ND EDITION). Addison-Wesley, 1996. References: Topic IV Block-structured procedural languages Algol and Pascal Chapters 5 and 7, of Concepts in programming languages by J. C. Mitchell. CUP, 2003. Chapter 5 of Programming languages: Concepts

More information

Introduction to Java Programming

Introduction to Java Programming Introduction to Java Programming Lecture 1 CGS 3416 Spring 2017 1/9/2017 Main Components of a computer CPU - Central Processing Unit: The brain of the computer ISA - Instruction Set Architecture: the specific

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Intermediate Code Generation

Intermediate Code Generation Intermediate Code Generation In the analysis-synthesis model of a compiler, the front end analyzes a source program and creates an intermediate representation, from which the back end generates target

More information

Design of CPU Simulation Software for ARMv7 Instruction Set Architecture

Design of CPU Simulation Software for ARMv7 Instruction Set Architecture Design of CPU Simulation Software for ARMv7 Instruction Set Architecture Author: Dillon Tellier Advisor: Dr. Christopher Lupo Date: June 2014 1 INTRODUCTION Simulations have long been a part of the engineering

More information

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573 What is a compiler? Xiaokang Qiu Purdue University ECE 573 August 21, 2017 What is a compiler? What is a compiler? Traditionally: Program that analyzes and translates from a high level language (e.g.,

More information

Compiling Techniques

Compiling Techniques Lecture 1: Introduction 20 September 2016 Table of contents 1 2 3 Essential Facts Lecturer: (christophe.dubach@ed.ac.uk) Office hours: Thursdays 11am-12pm Textbook (not strictly required): Keith Cooper

More information

Polygraph: Automatically Generating Signatures for Polymorphic Worms

Polygraph: Automatically Generating Signatures for Polymorphic Worms Polygraph: Automatically Generating Signatures for Polymorphic Worms James Newsome Brad Karp Dawn Song Presented by: Jeffrey Kirby Overview Motivation Polygraph Signature Generation Algorithm Evaluation

More information

Language Reference Manual simplicity

Language Reference Manual simplicity Language Reference Manual simplicity Course: COMS S4115 Professor: Dr. Stephen Edwards TA: Graham Gobieski Date: July 20, 2016 Group members Rui Gu rg2970 Adam Hadar anh2130 Zachary Moffitt znm2104 Suzanna

More information

1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation

1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation 1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation Administrivia Lecturer: Kostis Sagonas (kostis@it.uu.se) Course home page: http://www.it.uu.se/edu/course/homepage/komp/h18

More information