A Unified Approach for Preventing Attacks Exploiting a Range of Software Vulnerabilities

Size: px

Start display at page:

Download "A Unified Approach for Preventing Attacks Exploiting a Range of Software Vulnerabilities"

Antonia Chase
6 years ago
Views:

1 A Unified Approach for Preventing Attacks Exploiting a Range of Software Vulnerabilities Wei Xu, Sandeep Bhatkar, and R. Sekar Stony Brook University, Stony Brook, NY {weixu,sbhatkar,sekar}@cs.sunysb.edu Abstract Software implementation bugs are behind most security vulnerabilities reported today. Our analysis of CVE vulnerabilities in 2003 and 2004 indicate that 20% of them were classified as DOS attacks, 30% are due to design errors, and almost every thing else is due to implementation errors. Among implementation errors, 84% are due to generalized injection vulnerabilities that allow an attacker to modify the values of security-sensitive variables using carefully crafted inputs to vulnerable programs. Attacks in this category include buffer overflows, format-string attacks, SQL and shell-code injection attacks, directory traversal attacks, and cross-site scripting. In this paper, we present a unified approach that can stop exploitation of these vulnerabilities. Our approach is based on a dynamic taint analysis technique for C-programs that is capable of tracking information flow at the level of individual bytes. By leveraging the low-level features of C-language, our technique achieves soundness even in the presence of memory errors such as buffer overflows. At the same time, we exploit the higher-level features of C to optimize tainttracking to achieve low overheads of about 10% for server programs, and moderate overheads of between 25% and 60% for most CPU-intensive applications. Our experimental evaluation shows the effectiveness of the approach in stopping 9 real-world attacks that span the above categories. 1 Introduction Figure 1 shows the breakdown of the CVE vulnerabilities in 2003 and Of the total 155 attacks reported in these two years, 20% are classified as DOS attacks, 30% are due to design errors, and 3% are due to configuration errors. Of the remaining 47% of attacks, which are mostly due to software implementation errors, about 90% are due to a class of vulnerability that we refer to as generalized injection vulnerabilities. These vulnerabilities allow an attacker to modify the values of security-sensitive variables using carefully crafted inputs to a vulnerable program. The class of generalized injection vulnerabilities encompasses most commonly reported software security vulnerabilities, including: buffer overflows, format-string attacks, integer overflows, SQL, shell-code, and command-injection attacks, directory traversal attacks, cross-site scripting and so on. In this paper, we develop a unified approach that can tackle generalized injection vulnerabilities. This contrasts with previous approaches that tackled individual subclasses, such as buffer overflows [8, 1, 3, 21, 27, 5], format strings [7], command injection attacks [23, 22], etc. Even within each of these subclasses, our approach can detect attacks that were missed by previous techniques. For instance, [7] does not address format-string attacks involving vsprintf series of functions, whereas they can be detected by our approach. In addition, we can detect the use of format-string attacks to read (rather than corrupt) victim process memory. Similarly, the best known techniques for memory error detection [3, 20] don t detect buffer overflows within a C-structure, but our technique will correctly mark the overflow data as tainted, and prevent its use in security-sensitive operations. Our approach is based on a dynamic taint analysis technique for C-program that is capable of tracking information flow at a fine level of granularity. By leveraging the low-level features This research is supported by NSF grants CCR and CCR , and an ONR grant N

Figure 1: Breakdown of CVE security vulnerabilities found in 2003 and 2004 of C-language, which has been called as high-level assembly language, our technique achieves soundness even in the presence

2 Figure 1: Breakdown of CVE security vulnerabilities found in 2003 and 2004 of C-language, which has been called as high-level assembly language, our technique achieves soundness even in the presence of memory errors such as buffer overflows. At the same time, we exploit the higher-level features of C to (a) optimize taint-tracking to achieve low overheads of about 10% for server programs, and moderate overheads of 25% to 60% for most other programs, and (b) support limited form of reasoning about implicit information flows, thereby enabling us perform accurate tainting in the presence of character encoding/decoding such as URL encoding. We combine fine-grained taint technique with a policy enforcement technique that operates on taint-annotated arguments to security-sensitive operations performed by a victim program. Using this framework, we have been able to detect (and indeed prevent) all of the above forms of exploits. Our approach is powerful enough to address attacks that involve multiple software components. It is applicable to all C-programs, as well as programs in languages that are interpreted by C-programs, such as PHP, bash and Perl. By decoupling security-checks from the internal structure of application programs, our approach delivers the following benefits. First, it enables application operators to remedy common security vulnerabilities even before they are discovered, and do so without needing vendor cooperation. Second, the simplicity of our policy specifications provides more assurance on the correctness of security checks, as compared to what is possible when these checks are interwoven within the program text. 1.1 Motivating Examples Memory Error Exploits. There are many different types of memory error exploits, such as stack-smashing, heap-overflows and integer overflows. All of them share the same basic characteristics: they exploit bounds-checking errors to overwrite security-critical data, typically pointer values, with attacker-provided data. As a result, they are detected in the same manner in our approach. We illustrate stack-smashing below: void vulnerable_fun(char *username) { char temp[512]; while (*temp++ = *username++);... 2

3 This program copies untrusted data into a temporary array for processing, and fails to perform bounds-checks. As a result, attacker-provided data (held in the variable username) will overwrite the return address on the stack. As a result, when the function returns, control will be transferred to the location determined by the attacker typically, to an address within the array username that contains malicious binary code sent by the attacker. Format String Vulnerabilities. Consider a program that uses vfprintf series of functions. This function is used by printf, as well as user-defined printing functions such as the log message function below that needs to accept a variable number of parameters. void log_message(int level, char *fmt,...) { va_list args; va_start(args, fmt); if (level > loglevel) vfprintf(logfp, fmt, args);... A right form of call to this function is log_message(warn_level, "%s", input_data). But a programmer may incorrectly use print_message(warn_level, input_data). This works correctly if the input data is a plain string, but by putting in format directives an attacker can influence the behavior of the victim program. In the worst case, an attacker can use the %n directive to overwrite a return address and execute injected binary code. Alternatively, an attacker can read arbitrary data from the stack by using several %x directives. SQL and Command Injection. SQL injection is one of the most common (and serious) attacks on web applications. These server-side applications communicate with a web browser client to collect data, which are subsequently used to construct an SQL query to be sent to a back-end database. Consider the following statement (written in PHP) for constructing an SQL query: $cmd = "SELECT price FROM products WHERE name = ". $name. " " This query may be used in an e-commerce application to look up the price of an item. The variable name, whose value is given by an untrusted user, holds the product name. Suppose that the user provides the following value: xyz ; UPDATE products SET price = 0 WHERE name = OneCaratDiamondRing Note that semicolon is used to separate multiple SQL commands. Thus, the query constructed by the program will first retrieve the price of some item called xyz, and then set the price of another item called OneCaratDiamondRing to zero, so that the user can then purchase it without having to pay for it. Command injection refers to the general category of attacks where an attacker is able to control the value of some data that is interpreted as a command. Command injection vulnerabilities have also been reported in C-programs that used untrusted arguments to library functions such as system and popen. Cross-Site Scripting (XSS). Consider an example of a bank that provides a ATM locator web page that customers can use to find the nearest ATM machine, based on their ZIP code. Typically, the web page contains a form that submits a query to the web site that looks as follows: If the ZIP code is invalid, the web site typically returns an error message such as: 3

4 <HTML> ZIP code not found: </HTML> Note in the above output from the web server, the user-supplied string is reproduced. This can be used by an attacker to construct an XSS attack as follows. To do this, the attacker may send an HTML to an unsuspecting user that contains text such as: To claim your reward, please click <a href=" <script src= ></script>">here</a> When the user clicks on this link, the request goes to the bank, which returns the following page: <HTML> ZIP code not found: <script src= ></script></html> The victim s browser, on receiving this page, will download and run Javascript code from the attacker s web site. Since the above page was sent from this script will have access to sensitive information stored on the victim computer that pertains to the bank, such as cookies. Thus, the above attack will allow cookie information to be stolen. Since cookies are often used to store authentication data, stealing them can allow attackers to perform financial transactions using victim s identity. Directory Traversal. The goal of these attacks is to bypass directory-based access restrictions. For instance, a web server may use the following code to restrict CGI-scripts to a directory named /cgi-bin. void check_access(char *file) { if ((strstr(file, "/cgi-bin/") == file) && (strstr(file, "/../") == NULL)) { char *f = url_decode(file); /* allow access to f... */ The checking routine prevents the obvious approach of using.. to evade the access restriction, e.g., using the file name /cgi-bin/../bin/sh. However, an attacker can still traverse outside this directory by using the name /cgi-bin/%2e%2e/bin/sh, since we obtain.. as a result of URL decoding of %2e%2e. While it may seem obvious that the decoding step must precede the check, we point out that many directory traversal vulnerabilities, including the prominent ones involving IIS, have been based on similar programming errors. 1.2 Overview of Approach Fine-grained taint analysis. The first step in our approach is a source-to-source transformation of C-programs to perform fine-grained tracking of taint information at runtime. Taint originates at input functions, which need to be externally specified to the analyzer. Typically, there are a few such functions, e.g., read or receive system calls used by servers to read untrusted input over a network connection. The transformed program tracks taint information at the level of bytes in memory. By associating taint with bytes in memory, rather than variables in a C-program, our approach ensures correctness of tainting in the presence of pointer aliasing, unrestricted casting of pointers, and memory errors. In the case of stack-smashing example, this analysis will mark the return address on the stack as tainted. In the format-string example, every byte of the fmt string, including any format directives within it, will be marked as tainted. In the SQL injection example, this analysis will mark every character in the SQL query except SELECT price FROM products WHERE name =, as well as the closing single quote, as tainted. Similarly, in the XSS example, all characters within the script tag will be marked as tainted. Finally, in the directory traversal example, all characters in the decoded file name f will be marked as tainted. 4

5 Policy Checking. The second step in our approach is to check arguments to security-sensitive operations for unsafe content. This step is driven by a security policy specification. An interesting aspect of these policies is that they can refer to data values as well as their taint information. This enables us to construct simple yet accurate policies that can prevent exploits without raising false alarms. For instance, the example attacks described above can be prevented using the following policies. (See Section 3 for more detailed policies.) Memory error exploits. The stack-smashing attack (and any other control-flow hijack attacks) can be prevented by a policy that prevents any tainted pointer from being used as the target of control transfer. Data attacks, [6] such as those that corrupt the name of a file execve d by a server, can be detected using a policy that prevents tainted data from being used as the first argument to this system call. Format-string attacks. These can be prevented using a simple policy that prevents format directives (such as %s, %d, %g and %n) from being tainted. Since the analysis marked every byte of the fmt argument as tainted, the policy would be violated whenever the attacker includes format characters in their input. SQL and command injection attack can be stopped using a policy that prevents tainted metacharacters in SQL query. (Recall that the analysis determined that metacharacters such as ; in the SQL query were tainted.) Cross-site Scripting attack can be prevented by disallowing tainted script tags in web application output. Directory traversal attack can be stopped by disallowing tainted.. in the first argument of execve call. An important aspect of the above policies is that they do not restrict an application s use of security-sensitive operations. For instance, format directives aren t restricted if they originated from the program text. Similarly, a web application can use arbitrary SQL commands, including metacharacters, if they originated within the application s program text. Another important aspect of these policies is their critical dependence on fine-grained taint information. If we used coarse-grained taint analysis, then the entire SQL query would be marked tainted, thereby raising an alarm for all benign uses of the web application. 1.3 Organization of the Paper Section 2 describes our source-code transformation for fine-grained taint-tracking. Our policy language, together with sample policies, are described in Section 3. Section 4 discusses various issues in implementing the approach, and outlines our implementation. Section 5 provides an experimental evaluation of the approach. Related work is discussed in Section 6. Finally, concluding remarks appear in Section 7. 2 Transformation for Fine-grained Taint Tracking 2.1 Marking trusted and untrusted inputs In our transformation framework, marking actions can be associated with any function, and are specified using snippets of C-code. Consider the following example: 5

6 read(fd, buf, size, rv) --> if (isnetworkendpoint(fd)) taintbuf(buf, rv) else untaintbuf(buf, rv) This specification states that when the read function returns, the buffer should be tainted if the input is from the network. Otherwise, it should be untainted. An extra argument rv is included in the read function prototype that will contain the return value of the function. The C-code snippet uses support functions (or macros) to determine if a file descriptor is associated with a network, and to perform the actual taint marking. Marking actions are needed for every external input function used by an application. If all code, including libraries, is transformed, then marking needs to be specified only for system calls that return inputs. 2.2 Runtime Representation of Taint Information Our technique tracks taint information at the level of bytes in memory. This is necessary to ensure accurate taint-tracking for type-unsafe languages such as C. A one-bit taint-tag is used for each byte of memory, with 0 representing the absence of taint, and 1 representing the presence of taint. A bit-array tagmap stores taint information. The taint bit associated with a byte at address a is given by tagmap[a]. The tagmap array is allocated statically, so it takes up 1/8th of the total address space. Actual memory usage for tagmap will be close to 1/8th of the actual memory usage of the program. 2.3 Basic Transformation The source-code transformation described in this section is designed to track explicit information flows that take place through assignments and arithmetic and bit-operations. Implicit flows that take place through comparisons and conditionals are addressed in Section 2.5. At a high-level, explicit flow is simple to understand: the result of an arithmetic/bit expression is tainted whenever any of the variables in the expression is tainted a variable x is tainted by an assignment x = e whenever e is tainted. Specifically, Figure 2 shows the expression T (E) that computes the taint information associated with an expression E. Figure 3 shows, for each type of statement S, the corresponding transformed version T rans(s), which make use of T (E) definition in Figure 2. The transformation rules shown in the table are self-explanatory for most part, so we explain only on the rules on transformation of functions. Logically, the taint information for each function parameter a is provided through an additional parameter ta. At the beginning of the function body, the value of ta is copied into the tag bits associated with a. This step is necessary since the actual parameter passing actions, such as pushing values onto the stack, happen at a lower level and aren t visible at the level of the source language. A similar mechanism is needed to copy the tag bits associated with a return value into the caller. In our implementation, we don t use additional parameters since it introduces additional complexity in dealing with untransformed libraries. Instead, we use a separate stack to pass taint arguments. 6

7 E T (E) Comment c 0 Constants are untainted v tag(&v, tag(a, n) refers to n bits sizeof(v)) starting at tagmap[a] &E 0 An address is always untainted E tag(e, sizeof( E)) (cast)e T (E) Type casts don t change taint. op(e) T (E) for arithmetic/bit op 0 otherwise E 1 op E 2 T (E 1) T (E 2) for arithmetic/bit op 0 otherwise Figure 2: Definition of Taint for Expressions S T rans(s) v = E v = E; tag(&v, sizeof(v)) = T (E); S 1; S 2 T rans(s 1); T rans(s 2) if (E) S 1 if (E) T rans(s 1) else S 2 else T rans(s 2) while (E) S while (E) T rans(s) return E return (E, T (E)) f(a) { S } f(a, ta) { tag(&a, sizeof(a)) = ta; T rans(s)} v = f(e) (v, tag(&v, sizeof(v))) = f(e, T (E)) v = ( f)(e) (v, tag(&v, sizeof(v))) = ( f)(e, T (E)) Figure 3: Transformation of Statements 2.4 Optimizations The basic transformation described above is effective, but introduces high overheads, slowing down some programs by a factor of 5. To improve performance, we have developed several optimization techniques that we summarize below. Together, these optimizations have reduced the overheads for most programs to below 50%. Low-level Optimizations. Use of 2-bit taint values. In the implementation, accessing of taint-bits requires several bitmasking and unmasking operations, which degrades performance significantly. We observed that by using 2-bit taint values, the taint value for integer variables will be contained within a single byte, thereby eliminating the need for these masking operations. This approach does increase the memory requirements for tagmap, but on the other hand, opens up the possibility of tracking richer taint information in future. Use of mmap for tagmap allocation: Our original approach used a global variable to implement tagmap. As the pages of this large array (1GB) are loaded into memory they are initialized, and this introduces significant overheads. Note that this initialization is unnecessary programs will initialize variables before use, and our transformation will result in the tag bits getting initialized at this time. By using an mmap to allocate this memory, we eliminated this overhead. Unfortunately, this introduces an additional level of indirection in access to tagmap, since the base address is no longer a constant. To address this problem, our transformation fixes the location at compile-time (this location can be changed using a compile-time flag). Note that on most operating systems, the memory map used by a process is typically predictable, so it is feasible to do this. Intra-procedural Optimizations. Use of local taint tag variables when possible. To further reduce the overheads for accessing tagmap, we have implemented an optimization that uses local variables to maintain taint tags. Such an approach is sound in the absence of aliasing of local variables. Hence it is applied only to those functions that do not explicitly compute the addresses of local variables. Since local variables are typically used much more frequently than global variables in most programs, this optimization yields significant performance improvement in practice. 7

8 Intra-procedural dependency analysis. To further improve performance, we perform a local dependency analysis to determine whether a local variable can ever become tainted. Note that a local variable can become tainted only if it is involved in an assignment with a global variable, a procedure parameter, or another local variable that can possibly become tainted. Global optimizations. We are currently implementing a global optimization that is based on interprocedural context-sensitive (but flow-insensitive) analysis, which will be described in the final version of the paper. 2.5 Support for Implicit Information Flow Implicit information flow occurs when the values of certain variables are related by virtue of program logic, even though there is no assignments between them. A classic example is given by the code snippet [24]: y=0; if (x==1) y=1; Even though there is no assignments involving x and y, their values are always the same. The need for tracking such implicit flows has long been recognized. [11] formalized implicit flows using a notion of noninterference. Several recent research efforts [16, 28, 18] have developed techniques based on this concept. Noninterference is a very powerful property, and can capture even the least bit of correlation between sensitive data and other data. For instance, in the code: if (x > 10000) error=true; if (!error) { y = "/bin/ls"; execve(y); } there is an implicit flow from x to error, and then to y. Hence, a policy that forbids tainted data to be used as an execve argument would be violated by this code, leading us to the conclusion that noninterference-based taint analysis will lead to excessive false alarms in our application. In the context of the kinds of attacks we are addressing, it is clear that attackers need much more control over the value of y than the minimal relationship that exists in the code above. Thus, it is more appropriate to track explicit flows. Nevertheless, there is an important case that occurs frequently in web applications that requires handling of implicit flows. Often, data supplied to a web application is encoded, and needs to be decoded by the application before use. The decoded data depends on the input, and must be marked tainted. We use a pattern-based approach for supporting this form of implicit flow. Currently, two basic patterns are supported. Translation tables. Decoding is sometimes implemented using a table look up, e.g., y = translation tab[x]; where translation tab is an array and x is a byte of input. To handle this case, we modify the basic transformation so that the result of an array access is marked as tainted whenever the subscript is tainted. This pattern successfully handles the use of translation tables in the PHP interpreter. Decoding using if-then-else/switch. Sometimes, decoding is implemented using a statement of the form: if (x == + ) y = ; (Such code is often used for URL-decoding.) Clearly, the value of y is determined entirely on the value of x. More generally, switch statements could be used to translate between 8

9 Attack Type Policy Comment Control-flow jmp(addr) : taintedw ord(addr) term() Tainted values cannot be used hijack as a target of control transfer Format string PRINTF FUNCTION(fmt) = Format specifiers (e.g.%n,%x) printf(fmt, ) vsprintf(, fmt, )... should not be tainted PRINTF FUNCTION(fmt) : tainted(match(fmt, FORMAT SPECIFIER)) reject() Directory FILE FUNCTION(path) = path should not contain directory traversal execv(path, ) open(path, ) realpath(path, )... traversal strings (e.g. /../ ), or FILE FUNCTION(path) : the real path of path should not go tainted(match(path, DIR TRAVERSAL MODIFIER)) outside the base directory of path && escapep refixdir(path) reject() Cross site HTML PRINT FUNCTION(str) : No tainted script tags (e.g. script, scripting tainted(match(str, SCRIPT TAG)) reject() object) should be output to HTML SQL injection SQL QUERY FUNCTION(query) : SQL query string should not tainted(match(query, SQL METACHAR)) reject() contain tainted meta-chars Shell command SHELL COMMAND FUNCTION(cmd) : cmd argument of system or popen injection tainted(match(cmd, SHELL METACHAR)) reject() shouldn t contain tainted meta-chars Figure 4: Security policies for attack detection multiple characters. Our transformation handles them in the same way as a series of if-thenelse statements, for which the following pattern is applied: if (x == E) y = E. If E and E are constant-valued, we add an assignment tag(y) = tag(x). 2.6 Coping with Untransformed Libraries Ideally, all the libraries used by an application will be transformed using our technique so as to enable accurate taint-tracking. However, in practice, source code may not be available for some libraries, thereby necessitating work-arounds. In our approach, such libraries can be handled, but naturally, there is no way to compute the flow of taint information due to the execution of such functions. Users of our system have to supply summarization functions for each external function used by the application, or otherwise a compile-time error is reported. Summarization functions are specified in the same manner as marking functions, and are invoked at the return of each call to the external functions. These functions are often easy to write for simple APIs. We illustrate them with the example of memcpy. This example uses a support function to copy taint bits associated with a source buffer to taint bits associated with the destination. memcpy(dest, src, n) --> taint_copy_buffer(*dest, *src, *n); 3 Specifying Security Policies To detect generalized injection attacks, arguments to security-sensitive operations are checked against security policies, which can refer to data values as well as their taint information. These policies are specified using rules of the form Where: Condition Action: Where: defines the function call where the policy is to be applied. Macros, defined using a notation similar to regular expressions, can be used to combine rules for multiple functions. Condition defines the security checks, which can use support functions written in C. We have used the following functions in our policies: 9

10 taintedw ord(addr): returns whether or not the word stored at the address addr is tainted. tainted(s vec): returns whether or not any string element of the string vector s vec is tainted. match(s, pat): returns all the sub-strings in s that match the regular expression pattern pat. Action: defines the action to be taken if the condition is satisfied. For example, term() terminates the program execution, while reject() denies the request and returns with an error. Figure 4 shows the examples of a few simple yet effective policies to detect various attacks. For the control-flow hijack policy, (which can detect stack-smashing and most other memory error exploits), we use a special keyword jmp in the place of a function name, as we need some special way to capture low-level control-flow transfers that aren t exposed directly in the C-language. Depending on applications, policies can be further refined. For example, the SQL injection policy shown in Figure 4 cannot detect some SQL injection attacks that use only SQL keywords (but not meta-characters). A refined policy would be one that tokenizes the SQL query, and then checks to ensure that no two tokens appear within a tainted substring. This policy is based on the assumption that no two consecutive parameters in a SQL query (or a parameter followed by a keyword or a metacharacter) should come from the user. In the same manner, we can use a tokenizer-based policy to detect shell command injection attacks that taint the command components in the string argument passed to a shell. 4 Implementation We have implemented the program transformation technique described in the previous section. The transformer is written in Objective Caml and uses the CIL [17] toolkit as the front end to manipulate C constructs. The marking specifications, as well as the security policies, are inserted into the code by the transformer. In the future, we anticipate these specifications to be decoupled from the transformation, and be able to operate on binaries using techniques such as library interposition. This would enable a site administrator to alter, refine or customize her notions of trustworthy input and dangerous arguments without having access to source code. Our source-code transformer is currently unable to handle glibc due to its complexity, and its use of low level code in some parts. As such, we relied on writing summarization functions for the standard C-library functions that were used in our test programs. We wrote about forty such summarization functions. Whereas the transformation itself remains identical across all applications, including the PHP and Bash interpreters, the marking specifications will vary across different classes of applications. Below, we summarize the marking phases for the application classes we have studied so far. Network Servers Written in C. For network servers, we need to mark all external input consumed by the program. Specifically, in the case of the samba and wu-ftpd servers, we specified taint marking for functions such as read, fread, getc, recv, and recvfrom. A runtime check is made to determine if these inputs are from the network or from a file, and the contents marked as tainted in the former case. 10

11 Bash. Our experiments involved the use of bash to write CGI-scripts. Communication between a web server and a CGI script takes place through a well-defined interface. This interface delivers untrusted data (from a remote client) through the standard input and CGI environment variables such as QUERY STRING, HTTP COOKIE and HTTP REFERER. The marking functions are set up to taint the result of getenv on these environment variables, as well as the results of reading from stdin. PHP. PHP is a scripting language for generating dynamic HTML pages. In typical deployment, the PHP scripting engine is an Apache module that is loaded and run by the Apache web server. It communicates with the web server using a well-defined Apache Module API. This API defines the manner in which untrusted data from a web browser is sent to the PHP script, and how the generated pages are to be sent back to the browser. Once these input/output functions are identified, the marking phase is relatively simple: we mark all data received from a browser as tainted. Data that is generated directly by the Apache server isn t marked. An alternative approach is to transform both the Apache server and the PHP interpreter. The marking phase would be somewhat simplified in this case, as we don t need to manually identify data generated by the web server. Instead, we can simply mark any data read (from the network) by Apache as tainted. On the other hand, our current approach shows the feasibility of doing accurate taint analysis even if the source code to the web server weren t available. 5 Experimental Evaluation The primary goal of our evaluation is to determine the effectiveness of the approach in stopping generalized injection attacks (Section 5.1), and its performance (Section 5.4). False positives weren t evaluated rigorously, but we do note that none were observed in our evaluation. Based on the policies outlined in Section 3, we argue that false positives and false negatives due to policies are unlikely. 5.1 Attack Detection Table 5 shows the attacks used in our experiments. These attacks were selected so as to cover all of the attack categories we have discussed, and to span multiple programming languages. Wherever possible, we selected attacks on widely-used applications, since obvious security vulnerabilities would have been fixed in such applications. Thus, we are more likely test with more sophisticated attacks. To test our technique, we first downloaded the software packages shown in Figure 5. We downloaded the exploit code for each of the attacks, and verified that they were worked as expected. The verification step used transformed C-programs and interpreters, but with the policies disabled. Then we enabled the policies, and verified that each one of the attacks were prevented by these policies. We also verified that the policies did not cause any false alarms under normal operation. Network Servers in C. We studied three network servers: wu-ftpd versions and lower have a format string vulnerability in SITE EXEC command that allows arbitrary code execution. The attack is stopped by the policy that the format specifier %n in a format string should not be tainted. samba versions and lower have a stack-smashing vulnerability in processing a type of request called transaction 2 open. The attack violates the policy that the targets of control transfer be given by untainted data, and hence the attack is stopped. 11

12 CVE# Program Language Attack type Attack description CAN samba C Stack smashing Buffer overflow in call trans2open function CVE wu-ftpd C Format string via SITE EXEC command CAN pico server C Directory traversal Command execution via URL with multiple leading / characters and.. CAN phpbb PHP SQL injection via topic id parameter CAN phpbb PHP Directory traversal Delete arbitrary file via.. sequences in avatarselect parameter CAN SquirrelMail PHP Cross site scripting Insert script via the mailbox parameter in read body.php CAN SquirrelMail PHP Command injection via meta-character in To: field CAN PHP XML-RPC PHP, C Command injection Eval injection CVE nph-test-cgi BASH Shell meta-character using * in $QUERY STRING expansion Figure 5: Attacks used in effectiveness evaluation Pico HTTP Server (pserv) versions 3.2 and lower have a directory traversal vulnerability. The web server does include checks for the presence of.. in the file name, but allows them as long as their use doesn t go outside the CGI-bin directory. To determine this, pserv scans the filename left-to-right, decrementing the count for each occurrence of.., and incrementing it for each occurrence of / character. If the counter goes to zero, then access is disallowed. Unfortunately, a filename such as /cgi-bin////../../bin/sh satisfies this check, but has the effect of going outside the /cgi-bin directory. This attack is stopped by the directory traversal policy shown in Section 3. Web Applications in PHP. phpbb2 SQL injection: phpbb, a popular electronic bulletin board application, has an SQL injection vulnerability (in version 2.0.5) that allows an attacker to steal the MD5 password hash of another user. The vulnerable code is: $sql="select p.post_id FROM... WHERE... AND p.topic_id = $topic_id AND..." Normally, the user-supplied value for the variable topic id should be a number, and in that case, the above query works as expected. Suppose that the attacker provides the following value: -1 UNION SELECT ord(substring(user_password,5,1)) FROM phpbb_users WHERE userid=3/* This converts the SQL query into a union of two SELECT statements, and comments out (using /* ) the remaining part of the original query. The first SELECT returns an empty set since topic id is set to 1. As a result, the query result equals the value of the SELECT statement injected by the attacker, which returns the 5th byte in the MD5 hash of the bulletin board user with the userid of 3. By repeating this attack with different values for the second parameter of substring, the attacker can obtain the entire MD5 password hash of another user. Our technique detects this attack based on the policy for SQL injection described in Section 3. 12

13 SquirrelMail cross-site scripting: SquirrelMail is a popular web-based client. Version contains multiple cross-site scripting vulnerabilities, e.g., read body.php directly outputs values of user-controlled variables such as mailbox when generating response HTML pages. The attack is stopped by the cross-site scripting policy in Section 3. SquirrelMail command injection: SquirrelMail (Version 1.4.0) constructs a command for encrypting using the following statement in the function gpg encrypt in the GPG plugin 1.1. $command.= " -r $send_to_list 2>&1"; The variable send to list should contain the recipient name in the To field, which is extracted using the parseaddress function of Rfc822Header object in SquirrelMail. However, due to a bug in this function, some malformed entries in the To field are returned without checking for proper format. In particular, by entering recipient ; cmd ; into this field, the attacker can execute any arbitrary command cmd with the privilege of the web server. By tracking taint information and applying a policy that prohibits shell metacharacters in the first argument to the popen function, this attack is stopped by our technique. phpbb directory traversal: A vulnerability exists in phpbb, which, when the gallery avatar feature is enabled, allows remote attackers to delete arbitrary files using directory traversal. This vulnerability can be exploited by a two-step attack. In the first step, the attacker saves the file name, which contains.. characters, into the SQL database. In the second step, the file name is retrieved from the database and used in a command. To detect this attack, it is necessary to record taint information for data stored in the database, which is is quite involved. We took a shortcut, and marked all data retrieved from the database as tainted. (Alternatively, we could have marked only those fields updated by the user as tainted.) This enabled the attack to be detected using the directory traversal policy. phpxmlrpc/expat command injection: phpxmlrpc is a package written in PHP to support implementation of PHP clients and servers that communicate using the XML-RPC protocol. It uses the expat XML parser for processing XML. phpxmlrpc versions 1.0 and earlier have a remote command injection vulnerability that enables a client to send a malicious XML request, causing the server to execute arbitrary code. Our command injection policy stops this attack. We selected this application in our evaluation due to the fact that it uses a substantial external library, namely, the expat library for XML processing. The API provided by this library is quite complex, so it isn t practical to write summarization functions. Instead, we transformed the expat library as well, so that taint tracking is performed within the library. This enabled the attack to be detected. Bash Application. nph-test-cgi is a CGI script that was included by default with Apache web server versions and earlier, as well as with some version of NCSA web servers. It prints out the values of the environment variables available to a CGI-script. It uses the code echo QUERY STRING = $QUERY STRING to print the value of the query string sent to it. If the query string contains a * then bash will apply file name expansion to it. This allows an attacker to get the names of any directory on the web server. Clearly, the attacker should not be able to utilize the filename substitution (also called globbing ) feature of the shell, and this can be prevented using a policy that prevents the occurrence of 13

14 Program Lang- Workload Overuage head Apache C Webstone 2.5, 2 to 30 clients connected over 100Mbps network 2% wu-ftpd C Download a 12MB file 10 times. 15% tar-1.12 C Create a tar file of a directory of size 141MB. 19% bison-1.35 C Parse a Bison file for C++ grammar. 24% WebCalendar PHP WebCalendar-1.0.1, download a month s schedule 25 times. 32% enscript C Convert a 5.5MB text file into a postscript file. 34% bc-1.06 C Find factorial of % bash-2.05b BASH Loop times, each time computing a simple expression. 70% gzip C Compress a 12 MB file. 110% Avg. Overhead (excludes overhead of Apache) 45% Figure 6: Performance overheads. For Apache server, performance is measured in terms of latency and throughput degradation. For other programs, it is measured in terms of CPU overheads. tainted globbing characters such as * in an argument to the internal function shell glob filename used by bash for filename substitution. 5.2 False Positives False positives can arise due to (a) overly restrictive policies, (b) stronger-than-usual trust, and (c) conservative nature of taint-tracking. From the discussion of the policies in Section 3, it can be seen that the policies aren t unusually restrictive, and hence we do not expect many false alarms in practice. As expected, we did not experience any false alarms in our experiments. When an application places a higher degree of trust on its user, the policies needs to be refined to reflect this trust. A strength of our approach is the ease of changing policies so that the operators of applications can quickly remedy false alarms. The third reason for false alarm is the conservative nature of our taint-tracking. In particular, when arithmetic and bit-operations are used to combine tainted values (possibly with untainted values), the result may have very little relationship to original input. For the type of applications studied in our experiments, this seems to be relatively rare. These applications tend to interpret user input as data that is either stored, or used as parameters to other applications. In these cases, the tainted values are typically copied, and the use of arithmetic and other operations is rare. 5.3 False Negatives False negatives can arise due to (a) overly permissive policies, (b) subtle implicit information flows, and (c) use of untransformed libraries without adequate summarization functions. We believe that the policies described in Section 3 are unlikely to permit the types of attacks described in this paper. Implicit information flows can be a problem if the protected application is itself written by the attacker, in which case she can use implicit flows (rather than explicit assignments) to fool taint-tracking. Otherwise, implicit flows typically do not provide a good degree of control over the behavior of the victim application. As for external libraries, the best approach is to transform them, so that the need for summarization can be eliminated. If this cannot be done, then our transformation will identify all the external functions that are used by an application, so that errors of omission can be avoided. 14

15 However, if a summarization function is incorrect, then it can lead to false negatives, false positives, or both. 5.4 Performance Figure 6 shows the performance overheads of our transformation. The original and the transformed programs were compiled using gcc with -O2 optimizations, and executed on a 1.7GHz/512MB PC running Red Hat Linux 9.0. Execution times were averaged over 10 runs. For Apache, WebStone 2.5 was run for 30 minutes with 2 to 30 clients fetching 0.5KB to 5MB files over 100Mbps network. For server programs, the overhead of our approach is low. This is because they are I/O intensive, whereas our transformation adds overheads only to code that performs significant amount of copying and/or other CPU-intensive operations. For CPU-intensive C-programs, the overhead is moderate between 24% and 56% for most programs. Interpreters have a somewhat higher overheads, possibly because they too are generally CPU intensive. (Note that bash benchmark is a CPUintensive micro-benchmark, which tends to exaggerate the overheads.) Finally, gzip has 110% overhead mainly because of its heavy use of global variables, where our optimizations aren t that effective Effect of Optimizations. The optimizations discussed in Section 2.4 have been very effective in reducing the overheads to be reasonably low for most programs. We comment on their effectiveness below: Use of 2-bit taint tags reduced the overheads by 15% to 75% of the base running time (i.e., runtime of the untransformed program). Programs that operate on integers gain much more due to this optimization than those operating on strings. Use of mmap reduced the overheads by a further 5% to 50%. Use of local taint variables reduced the performance overheads by 30% to 400%. For many CPU-intensive programs, more than 90% of their variable accesses are to local variables, with programs such as bc approaching 99%. This is one reason for the effectiveness of this transformation. The second reason is the fact that this transformation enables other compiler optimizations that have the effect of eliminating tags for many local variables. In contrast, when the tag variables are global, these optimizations aren t possible since they may not be sound (due to possible aliasing etc.) Intraprocedural analysis and optimization further reduces the overhead by upto 20%. The gains are modest because gcc optimizations have already eliminated most local tag variables after the previous step. When combined, these optimizations reduce the overhead by a factor of 2 to 5. 6 Related Work Buffer overflows and related memory errors have received a lot of attention, and several efficient techniques have been developed to address them. Early approaches such as StackGuard [8] and its ProPolice [1] focused on just a single class of attacks. Recently, more general techniques based on randomization have been developed, and they promise to capture most memory errors [14, 3]. However, due to the nature of the C-language, these methods still cannot detect certain types of 15

16 attacks, e.g., overflow from an array within a structure to an adjacent variable. Fine-grained taint analysis can capture these attacks whenever the corrupted data is used as argument in a sensitive operation. (This is usually the case, since the goal of an attacker in corrupting that data was to perform a security-sensitive operation.) Although our overheads are generally higher than the techniques mentioned above, we believe that they are more than compensated by the increase in attack coverage. The idea of using fine-grained taint analysis for detecting buffer overflows and related errors was suggested in [21, 27, 5]. We improve on these works by providing a practical technique that works on commodity hardware and software. In contrast, [27, 5] require processor modifications to support taint-tracking. [21] can work on existing processors (specifically, on x86/linux), but often slows down programs by a factor of 10 or more. Errors in web applications have received a lot of attention recently. [15] proposes a static analysis to detect potentially dangerous information flows in Java programs. Being a static analysis, its advantage is that it can warn about potential bugs before they are exercised. It is quite accurate in detecting potential flows. However, a frequent problem, as illustrated in the examples in the introduction, is that dependence between untrusted sources and sensitive operations is part of application logic. Without more precise taint analysis, such as what we do in this paper, it isn t possible to distinguish between real vulnerabilities and benign dependencies. WebSSARI [12] uses a type inference based information flow analysis to find out potentially dangerous flows in PHP applications. When it finds such a flow, it inserts checking functions that perform comprehensive checks on the safety of these arguments. Unfortunately, these checks cannot distinguish between the cases where dangerous arguments are used by the application itself, as opposed to those cases where they are the result of an attack. For instance, a web application will need to generate pages containing scripts, so an approach that warns about <script> in the application s output will end up generating tons of spurious warnings. SQLrand [4] detects SQL injection attacks by leveraging instruction set randomization techniques to hide the actual encoding of SQL instructions from attackers. This technique could be circumvented if the key used for the randomization were exposed. In the recent few months, [22] and [23] have proposed the idea of using fine-grained taint analysis to detect injection attacks. Their main emphasis was to show the feasibility of the approach, and this was accomplished by manually modifying a PHP interpreter to track taint information for string data, and demonstrating very low overheads. Our technique, developed independently, takes a more general approach: that of an automated program transformation. Moreover, since our approach operates on C-programs, it is applicable to most server programs (which are written in C), as well as many scripting and web application languages such as BASH, PHP and Perl, whose interpreters are written in C. Another important benefit of our approach is its comprehensive treatment of tainting that ensures accurate taint-tracking for all types data, as opposed to a manual transformation that is error-prone. Accuracy of tainting is preserved even in the presence of lowlevel errors such as buffer overflows, which makes it possible to develop a unified approach that can detect a wide range of attacks. Information flow analysis has been researched for a very long time [2, 10, 9, 16, 28, 19, 18, 25]. Early research was focused on multi-level security, where fine-grained analysis was not deemed necessary [2]. More recent work has been focused on tracking information flow at the level of variables. Latest research efforts, including our work, tracks information flow at an even finer granularity. Perl has a taint mode [29] that tracks taint information at a coarse granularity that of variables. Other static taint analysis techniques have been developed to address specific types of programming errors such as user/kernel pointer bugs [13], format string bugs [26], and bugs in 16

Practical Techniques for Regeneration and Immunization of COTS Applications

Practical Techniques for Regeneration and Immunization of COTS Applications Lixin Li Mark R.Cornwell E.Hultman James E. Just R. Sekar Stony Brook University Global InfoTek, Inc (Research supported by DARPA,