Pattern Recognition and Applications Lab Binary Analysis and Reverse Engineering Ing. Davide Maiorca, Ph.D. davide.maiorca@diee.unica.it Computer Security A.Y. 2017/2018 Department of Electrical and Electronic Engineering University of Cagliari, Italy Contents Introduction ELF Structure Introduction to readelf From ELF to Memory Static Analysis Assembly X86 Basics and introduction to objdump Memory analysis during function calls Dynamic Analysis Introduction to gdb Dynamic Analysis of Memory 2
Introduction 3 About Me February 2012 Master of Science Electronic Engineering cum laude November 2013 April 2014 Visiting Student Ruhr Universität Bochum (Prof. Dr. Thorsten Holz) March 2016 Ph.D. Doctor Europaeus University of Cagliari CLUSIT (Italian Association For Computer Security) Thesis Prize winner Currently PostDoctoral Fellow University of Cagliari Resarch Topics Malware Analysis and Detection in Documents (PDF, Word, Flash ) Android Malware Analysis and Detection Adversarial Machine Learning Other Activities Mobile Forensics Program Committees for Conferences 4
Virtual Machine Runs Ubuntu You can find the files of this lecturer directly on the VM User: sicurezza1718 Password: security1718 5 Introduction Binary Analysis In Computer Security, we are often interested in finding anomalies in a executable (binary) program: Hidden actions Possible attacks or attempts to steal information Bugs However, we often do not have the source code of the program To analyze a binary, we must therefore resort to reverse engineering techniques This is often the only way to understand something of a program! A very complex art! 6
Introduction - Reverse Engineering When you program, you usually switch from source codes to binary files Problem: are you really sure that the binary exactly behaves as you wanted? You already know it is not that simple... We often refer to «bug» when defining a wrong/unexpected behavior of a program Reverse Engineering: analyzing the code of an already compiled program to understand its behavior This means that you are going to see how a program works in a very detailed way Get ready to make your hands dirty! J 7 Challenge! Our lectures will be challenge-driven The idea is acquiring the concepts that will allow you, in practice, to solve the challenge In the virtual machine, you should find a file called sum_number Try to run it... You do not have the source code, so answering the question is at the moment not possible However, thanks to what you learn in these lectures, you will be able to unveil many mysteries 8
ELF Structure 9 Creazione di un Eseguibile (Linux) Compiler Linker Source (.c) Object File (.o) ELF Executable 10
ELF Executable and Linkable Format Executable for Linux (32 and 64 bit) 32 and 64 bit executables are NOT the same Memory management and specific instructions are different The executable is composed of four parts ELF Header Basic information about the file (e.g., architecture type, addresses, and section sizes) Section Header Describes the position of all sections of the executable (compulsory in.o files) Program Header Describes the executable sections that are loaded in memory during the program execution (segments compulsory in the executable file) Data The real file data 11 ELF (2) Compiler Linker Source (.c) Object File (.o) ELF Executable The Linker changes the addresses of the file sections depending on specific needs (relocation) 12
ELF Header ELF files can be analyzed in practice with different tools We start with readelf Can be found in any Linux distribution Provides information on the file structure Let s start... readelf h sum_number Yes, that s the headerexecutable! Magic number Four bytes that define the file type Entry point VIRTUAL memory address that identifies the start of the program Does the program really begin with main? You faith is going to be changed soon...j Provides information about offsets and sizes of the ELF sections 13 Section Header Let s dig deeper in the file readelf S sum_number 35 sections (not all of them are important)! We are considering already relocated sections (complete executable).text Instructions of the process and read-only data Changes to these values-> Segmentation Fault! Read-only data are generally marked.data Initialized static data.bss Non initialized static data 14
Section Types and Parameters Types PROGBITS: sections that contain data that are actually used by the program NOTE: extra data that are not useful for the execution of the program SYMTAB/DYNSYM: sections that contain information about symbols. Symbols are names that represent data that are used by machine code STRTAB: section that contains strings that are used by the executable REL: relocation table Other parameters Address: virtual memory address of the section Size: size of the section Offset: starting point inside the file Flag: execution flags You can overlook the other parameters 15 Program Header To access the program header, just type: readelf l sum_number Program Header is composed of segments Each segment is composed of a group of sections LOAD type segments are loaded in memory when the program is run In our example, segment 02 contains the.text section Contains the machine code (flags: Read/Execute) -> This segment is often named.text Segment 03 contains.data e.bss sections.data and.bss represent,respectively,initialized and uninitialized data (flags: Read/Write) When representing memory, sections.data and.bss are considered as separate segments Careful with offsets! PHDR is the program header table (in our example, it starts from offset 52) LOAD starts from offset 0 (from the file start), but only uses the first 0x6a4 bytes, although 0x1000 are loaded (i.e., 4096 aligment value due to memory paging) 16
From ELF to Memory Executable Addresses that increment towards the bottom Memory (note how the address of.text is lower than.data) 17 Linux X86 Process in Memory - Structure Note that, in the picture on the left,.text section is on the lower part of the memory, but addresses are growing towards up! Stackframe Base Stack always accumulates towards lower addresses (the opposite of the process)! 18
Linux X86 Process in Memory - Stack (2) Heap Dynamically allocated memory Stack Composed of frames Contains information about functions (paramters, return addresses, local variables ) Everytime a function is called, a frame is allocated in memory Function arguments Arguments that the function receives Return address The address to which the function returns at its end Frame pointer It is considered as the «base» of the frame Local variables Variable that are defined in the function 19 Static Analysis 20
Disassembling an Executable Until now, we have inspected the structure of the executable Now it s time to understand what the executable does We want to understand which instructions the processor really executes by not executing the file itself (static analysis) This is called disassembling To this end, we can use the tool objdump objdump d sum_number Static Analysis has a lot of advantages: It s usually very fast (especially if made automatically) It immediately provides a lot of information Avoids executing the file! 21 Assembly X86 Basics Intel CISC Complex Instruction Set Computer A lot of instructions! (but we will only use a small subset) AT&T Convention for instructions (opcode, source, destination) Used by Linux (Windows uses the Intel convention, where source and destination are reversed) 32 bit Addressing Little endian! LESS significant bytes go to LOWER addresses Example for word 0x90AB12CD Memory Address Saved Byte 1003 90 1002 AB 1001 12 1000 CD 22
Memory Addressing Be VERY careful to little endianess It can be confusing at times! Whenever a pointer refers to a memory block this will ALWAYS point to the LOWEST part of the block Consider two addresses: 0xbfff0000 and 0xbfff0004. On the block pointed by the first address, you save an End of Second Block ARRAY 123 (which is represented, in hex, by 0x31 0x32 0x33), whilst on the second block you save the NUMBER 123 Start of Second Block (represented by 0x7b) End of First Block EACH WORD IS ALWAYS READ BY CONSIDERING 4 BYTES FROM THE BLOCK START 0xbfff0000: 0x00333231 Start of First Block 0xbfff0004: 0x0000007B Memory Address Saved Byte 0xbfff0007 00 0xbfff0006 00 0xbfff0005 00 0xbfff0004 0xbfff0003 0xbfff0002 0xbfff0001 0xbfff0000 7B 0x00 0x33 0x32 0x31 23 Assembly X86 Registers and Instructions 8 «General purpose» registers + 1 that points to the next instruction (we are only going to consider the ones used by our example!) EAX, EDX: «Accumulator» registers ESP: Stack Pointer : Pointer to the stackframe base (when a function is called) EIP: Pointer to the next instruction Basic Instructions PUSH: Push a word to the stack POP: Removes a word from the stack MOV: Moves a value from register to register or from register to memory MOVL: Moves a 4 byte word from a register to memory (and viceversa) AND: Logical AND operation ADD/SUB: adds/removes a value from a register LEAVE: Complete some operations on the stack (see next slides) RET: Same as return CALL: Calls a function NOP: Doesnotexecute anything The operation xcgh %ax %ax can be considered similar to a NOP (but we will not add details) 24
DISCLAIMER REGISTER VALUES (ebp, esp, eax ) CAN VARY DEPENDING ON THE ARCHITECTURE AND ON THE OPERATING SYSTEMS, AND IN THESE SLIDES YOU WILL ONLY FIND AN EXAMPLE TAKEN FROM AN EXECUTION OF THE FILE IN A VIRTUAL ENVIRONMENT 25 First Look Let s have a look at the section.text retrieved with objdump There are a lot of functions and instructions A C program starts (in its source code) from the function main Let s look for it! What can we intuitively grasp from this function? The first thing we can look for is retrieving other function calls Let s look then for «call» instructions We see that three functions are actively called: sum, printf, puts Puts is like printf without formatting. It is often used to print newlines Therefore, our program calls a function called sum and prints something, along with a newline! 26
Static Analysis of Code - main Stackframe (main function) ESP _start Return ADDRESS esp = 0xbffff078 Starting situation Each «block» is composed of 4 bytes Main is always called by a routine called _start (a compiled Assembly program does not start from main ) Before calling a new function, the caller pushes to the stack the return address from which the program resumes its flow 27 Static Analysis of Code - main Stackframe (main function) _start Return ADDRESS ESP push %ebp esp = 0xbffff078 The old stackframe base pointer is saved PUSH FIRST MOVES THE POINTER BY 4 BYTES, THEN IT WRITES THE ELEMENT! 28
Static Analysis of Code - Main Stackframe (main function) _start Return ADDRESS ESP esp = 0xbffff078 ebp = 0xbffff078 push %ebp mov %esp %ebp Now the current stackframe base pointer points to the base of the main stackframe (the pointer was located in the _start function) 29 ESP Static Analysis of Code Memory Allocation Stackframe (main function) _start Return ADDRESS push %ebp mov %esp %ebp and 0xffffff0 %esp esp = 0xbffff070 ebp = 0xbffff078 We are preparing the program to free some space to store local variables and parameters This instruction moves ESP to a location whose address is a multiple of 16. Intel Processors feature special instructions which always require that ESP stays in a memory address that is multiple of 16 after the space for variables has been prepared. This preliminary instruction ensures that, when the space is completely ready, ESP always points to an address that is multiple of 16 (see next slide) 30
Static Analysis of Code Memory Allocation Stackframe (main function) _start Return ADDRESS push %ebp mov %esp %ebp and 0xffffff0 %esp sub 0x20, %esp esp = 0xbffff050 ebp = 0xbffff078 ESP moves 32 bytes down through the stack (0x20) in order to free some space for local variables, as well as for parameters of another function We are decreasing by a multiple of 16 (see previous slide) ESP PUSH+MOV+(AND)+SUB -> This is typically done when a function wants to call another one! 31 Static Analysis of Code Function Call Stackframe (main function) _start Return ADDRESS esp = 0xbffff050 ebp = 0xbffff078 push %ebp mov %esp %ebp and $0xffffff0 %esp sub $0x20, %esp movl $0x5, 0x4(esp) 5 After some space has been freed, when a new function is called the caller starts pushing the new function parameters (they ALWAYS go at the end of the stackframe in a reverse order-> The first parameter ALWAYS goes to the bottom). ESP 32
Static Analysis of Code Function Call Stackframe (main function) _start Return ADDRESS esp = 0xbffff050 ebp = 0xbffff078 push %ebp mov %esp %ebp and $0xffffff0 %esp sub $0x20, %esp movl $0x5, 0x4(esp) movl $0x4, (esp) 5 4 ESP The second parameter (the first one in the C code) is pushed. The function takes two parameters whose values are 4 and 5 -> func(4, 5) 33 Static Analysis of Code Function Call Stackframe (main and sum functions) 5 esp = 0xbffff04c 4 main stackframe ebp = 0xbffff048 ESP RETURN ADDRESS Sum stackframe (NOTE: Even if conceptually the passed parameters are part of the new function, it is common to consider the return address as the start of the new stackframe) push %ebp mov %esp %ebp and $0xffffff0 %esp sub $0x20, %esp movl $0x5, 0x4(esp) movl $0x4, (esp) call 804844d <sum> Calling a new function means saving in the stack the return address (which means, the address of the next instruction of the main function) and going to the beginning of the new function 34
Static Analysis of Code Sum Function Stackframe (main and sum functions) 5 esp = 0xbffff048 4 main stackframe ebp = 0xbffff048 ESP RETURN ADDRESS main()... movl $0x5, 0x4(esp) movl $0x4, (esp) call 804844d <sum> sum stackframe push ebp mov %esp, %ebp... The new function always loads in its stackframe the of the caller (in this case, the of main is saved) 35 Static Analysis of Code Sum Function Stackframe (main and sum functions) 5 main stackframe esp = 0xbffff04c ebp = 0xbffff078 4 RETURN ADDRESS ESP... movl $0x5, 0x4(esp) movl $0x4, (esp) call 804844d <sum> sum stackframe push ebp mov %esp, %ebp... leave leave completely cleans the stackframe of the leaving function and restores the ebp 36
ESP Static Analysis of Code Sum Function Stackframe (main and sum functions) 5 main stackframe esp = 0xbffff030 ebp = 0xbffff058 4... movl $0x5, 0x4(esp) movl $0x4, (esp) call 804844d <sum> sum stackframe push ebp mov %esp, %ebp ret loads, by using a POP... instruction (the opposite of leave PUSH, it removes the element ret from the stack and goes 4 bytes back) the return address on eip (next instruction register) 37 Analisi statica del codice Funzione Sum Stack Frame (Per la funzione main) _start Return ADDRESS esp = 0xbffff050 ebp = 0xbffff078 9 5 4 ESP push %ebp mov %esp %ebp and $0xffffff0 %esp sub $0x20, %esp movl $0x5, 0x4(esp) movl $0x4, (esp) call 804844d <sum> mov %eax, 0x1c(esp) %eax contains the result of the sum function, which is stored under the location pointed by. The location should be -4, but the alignment instruction (in blue) further moves everything by 8 bytes 38
Static Analysis of Code Calling printf Stackframe (main function) _start Return ADDRESS esp = 0xbffff050 ebp = 0xbffff078 9 9 4 ESP push %ebp mov %esp %ebp and $0xffffff0 %esp sub $0x20, %esp movl $0x5, 0x4(%esp) movl $0x4, (%esp) call 804844d <sum> mov %eax, 0x1c(%esp) mov %eax, 0x4(%esp) Load parameters for the next call 39 Static Analysis of the Code Calling printf Stackframe (main function) _start Return ADDRESS esp = 0xbffff050 ebp = 0xbffff078 9 push %ebp mov %esp %ebp and $0xffffff0 %esp sub $0x20, %esp movl $0x5, 0x4(%esp) movl $0x4, (%esp) call 804844d <sum> mov %eax, 0x1c(%esp) mov %eax, 0x4(%esp) movl $0x8048540, (%esp) 9 0x8048540 ESP This address refers to a string (which are stored in dedicated sections of the file) 40
Static Analysis of the Code Calling printf Stackframe (main function) _start Return ADDRESS esp = 0xbffff050 ebp = 0xbffff078 ESP 9 9 0x08048540 push %ebp mov %esp %ebp and $0xffffff0 %esp sub $0x20, %esp movl $0x5, 0x4(%esp) movl $0x4, (%esp) call 804844d <sum> mov %eax, 0x1c(%esp) mov %eax, 0x4(%esp) movl $0x8048540, (%esp) call 8048310 <printf@plt>... Calls printf (which takes as parameters a string and a value to print) 41 Further notes To fully understand the solution of the challenge, you also have to analyze the sum function The principle is the same as the one of the main function (even simpler!) Can you find the solution to the challenge by using static analysis? Additional question: can you guess the solution of the challenge by only inspecting the mainfunction? 42
Dynamic Analysis 43 Dynamic Analysis Static analysis provides valuable information on the executable However, this is often not enough! Some information is only available at runtime Understanding the register values by only using static analysis might be too complex! The executable is obfuscated to complicate Static Analysis To cope with these problems, dynamic analysis can be really helpful Dynamic Analysis monitors the execution of the program, allowing to analyze memory, instructions and the program flow at runtime 44
Introduction to GDB GDB = Gnu DeBugger It s the most popular open source program to analyze x86/x64 executables Works on Linux, Windows, OSX A lot of functionality! Allows to stop the execution of the program at a specific instruction (breakpoints) You can analyze memory and registers It also allows to set up conditional breakpoints, which are subjected to the occurrence of certain events GDB allows to spot bugs in a program (or to exploit them to our advantage) 45 Using GDB Let s go back to sum_number gdb sum-number Type run to execute the program From objdump, we see that the function starts from 0x804844d With the x/i command potete vedere l istruzione ad un certo indirizzo If you type x/i 0x804844d you can see the next instruction to execute Let s see what happens inside the function «sum» The address of the first instruction is 0x0804844d break *0x0804844d DO NOT FORGET THE ASTERISK run The execution is stopped BEFORE RUNNING THE INSTRUCTION (Warning: the following slides will continue the execution, so do not stop the execution) 46
X Command Very powerfulcommand! X show the content of the memory basing on a certain type of representation (for instance, you can represent a sequence of bytes as istructions or keep them as bytes) If you type x/ni you can see n instructions from a specific address... If you type x/nb you can visualize n bytes starting from the lowest part of the block... Example: x/4b $(ebp+4) shows the address of the return function (the caller of «sum»), given by: 0x08048480 This is how it appears: 0x80 0x84 0x04 0x08 (THE LEAST SIGNIFICANT BYTE IS ON THE LEFT) It is more effective to visualize data with words x/w $(ebp+4) shows the same result as word, starting from the most significant byte DO NOT ONLY USE x (without slash), as it will use the viewing style of its last call 47 Memory Analysis with GDB We can obtain information on the loaded stackframes Type frame There is only one available frame (the one of «sum») Select the frame with f 0 info f Shows all the information on the needed registers Shows the current ebp, the previous ebp and the return address (saved eip) Let ssee the register contents! info registers ebp Shows the value of (THE NEW HAS NOT BEEN UPDATED YET, SO YOU ONLY SEE THE LATEST ONE) info registers esp Current pointer to the stack 48
Memory Analsysis with GDB GDB can show the memory content Let s see what we can find in the location pointed by esp info registers espreturns «0xbffff04c» So type x/w 0xbffff04c The result is «0x08048480», which is the address of the instruction after the call to sum Esp is therefore pointing to the location that contains the return address This is correct, as we still have to push the new ebp to the stack Let sgo on, instruction by instruction Use the command ni Use it now for three times Sometimes, it is possible to see instructions with some references to the original variables used by the programmer This is because the program contains «debugging» information 49 Memory Analysis with GDB (2) The sum function sums two parameters and store the results in a new variable Sum parameters are stored in the eax and edx registers «mov 0xc(%ebp), %eax», «mov 0x8(%ebp), %edx» How to retrieve the values stored in eax ed edx? First way: info registers ebp -> 0xbffff048 x/b 0xbffff048+(0xc) -> YOU CAN READ MEMORY ALSO AT SPECIFIC OFFSETS! J -> You get 5 (the SECOND parameter) x/b 0xbffff048+(0x8) -> You get 4 (the FIRST parameter) Second way: Type ni two times info registers eax, info registers edx Third way: print a and print b, as the function takes as input a and b Works ONLY if there are debugging information available...j 50
Summing up You learnt many things from this lecture Linux executables structure Loading Linux executables to memory Analyzing a Linux executable, by using the fundamentals of assembly x86 and two analysis techniques: Static analysis Dynamic analysis Next question is: what if an attacker is able to exploit such information to his advantage? Stay tuned for the next lesson! 51