Computer Science Research Symposium 2008 (Sym 08)

Size: px

Start display at page:

Download "Computer Science Research Symposium 2008 (Sym 08)"

Jesse Jones
5 years ago
Views:

1 Proceedings 1 st Annual Computer Science Research Symposium 2008 (Sym 08) Computer Science Department Colorado State University April May 2008 Edited by Asa Ben-Hur Yashwant Malaiya Indrajit Ray

2 Message from the Research Committee The Research Committee is happy to present this Digest of the Computer Science Research Symposium, held for the first time in 2008 in this format. This concept was developed by three committee members to encourage research and exchange of ideas within the Computer Science Department. Our Call for Papers invited 3-page articles, which were submitted, using a conference management system available at managed by Indrajit Ray. The submission was open to all CS students. We also invited interested faculty members to serve as the Program Committee members to provide a light review of the submissions. Six of the papers were selected for 15-minute presentations. We used two sessions of the BMAC (Barney's Monday Afternoon Club) for scheduling the presentations. This digest also includes two other papers that were accepted for inclusion in the Digest. The Digest is available at the BMAC web site as a departmental technical report. We expect that expanded versions of the several of the papers will eventually appear in major conferences or journals. We thank the following PC members for their help with the review process. Sudipto Ghosh Ross McConnell Indrakshi Ray Chuck Anderson Indrajit Ray Yashwant Malaiya Asa Ben-Hur We are happy to note that the first implementation of the concept worked quite well. We hope that this format will again be used for the CS Research Symposium to promote research, interaction and external publishing activity in the department. Asa Ben-Hur, Research Committee Member Yashwant K. Malaiya, Research Committee Chair Indrajit Ray, Research Committee Member

3 1 Generation of Data-Flow Analyses with DFAGen Andrew Stone (Colorado State), Michelle Strout (Colorado State), Shweta Behere (Avaya) Index Terms Program analysis, static analysis, optimization, data-flow analysis, compilers, tools. Abstract Data-flow analysis is a commonly used technique to gather program information for use in transformations such as register allocation, dead-code elimination, common sub-expression elimination, scheduling, and others. This paper presents a tool, DFAGen - the Data- Flow Analysis Generator, that allows compiler writers to specify, and generate, data-flow analyses using a succinct specification language. Other tools to generate data-flow analysis algorithms remove the need for implementers to explicitly write code that iterates over statements in a program, but still requires them to implement details regarding the effects of aliasing, side effects, arrays, and user-defined structures. The DFAGen tool generates implementations for locally separable (e.g. bit-vector) data-flow analyses that are pointer, side-effect, and aggregate cognizant from an analysis specification that assumes only scalars. Analysis specifications are typically seven lines long and similar to those in standard compiler textbooks. I. INTRODUCTION Program analysis is the process of gathering information about programs to effectively derive a static approximation of behavior. This information can be used to optimize programs, aid debugging, verify behavior, and detect potential parallelism. Data-flow analysis is a common technique for statically analyzing programs. It works by propagating program information, encoded as data-flow values, through a control flow graph of statements or basic blocks. It is often formalized by a lattice-theoretic framework [3] [1], which specifies the analysis as a transfer function, meet operator, and direction. Popular compilers textbooks such as the Dragon Book [1] specify data-flow analyses with data-flow equations. For example, reaching definitions analysis can be specified with the equations in figure 1. Reaching definitions is a flow-sensitive analysis that determines what definitions of a variable may reach specific program points. The results of such an analysis can be used to perform optimizing transformations such as constant propagation, which replaces the use of a variable with a constant value when it is safe to do so. Compiler writers are familiar with data-flow specifications as equations so our goal is to develop a tool that in[s] = out [p] p pred[s] out [s] = gen [s] (in [s] kill [s]) gen[s] = s if def[s] kill[s] = (in[s] s) if def[s] emptyset Figure 1. Example of how reaching definitions analysis is typically presented in a compilers textbook. Note def is the set of variables assigned at a specific statement, s is a program statement, and pred is the set of statements that immediatly proceed a given statement. int *pointstoone; int *pointstotwo; int a, b; S1 a =... S2 b =... S3 pointstoone = &a; S4 if(a < b) { S5 pointstotwo = &a; } else { S6 pointstotwo = &b; } S7 printf("vals = %d, %d\n", *pointstoone, *pointstotwo); Figure 2. Example of may and must aliasing issues. generates an analysis from such equations. However, a number of issues common in modern languages preclude such a tool from easily being developed: Many data-flow equations are specified in terms of what variables are defined or used at a given statement. Due to the effects of aliasing it is necessary that analysis writers consider whether dataflow equations will require the set of variables that may be defined or must be defined at said statement. Figure 2 shows an example of such behavior. In statement S7, it can be determined that *pointstoone must reference the value assigned to a, however, *pointstotwo may use either a or b. In existing tools it is necessary to specify a unique transfer function for each statement type in the intermediate representation. Most programming languages today, consist of structures that complicate the process of data-flow analysis. Such structures include: arrays, objects,

4 2 Analysis:~ReachingDefinitions meet: union flowvalue: stmt direction: forward style: may gen{s]: {s defs[s]!= empty} kill{s]: {t defs[t] subset defs{s]} Figure 3. DFAGen specification for reaching definitions. Def Analysis : id meet : (union intersection) flowtype : id direction : (forward backward) style : (may must) gen[ id ] : Set kill[ id ] : Set Set id[id] BuildSet emptyset Expr Expr Expr Op Expr Op Expr Set Cond Expr CondOp Cond Op union intersection difference CondOp and or subset superset equal not equal proper subset proper superset BuildSet {id : Cond} Figure 4. Grammar for analysis, GEN, and KILL set definition. pointers, and function calls. II. DFAGEN SPECIFICATION LANGUAGE DFAGen specifies data-flow analyses as GEN and KILL data-flow equations and a series of properties. A class of data-flow analyses, those which are locallyseparable, are expressible in this format. Such definitions are similar to those in compiler textbooks as can be seen by comparing figures 1 and 3. Figure 4 shows an abstract grammar for analysis specification. The GEN and KILL sets are specified in terms of mathematical set-notation. GEN and KILL equations can operate on a set of incoming data-flow values (those values computed previously by the analysis) and on predefined set structures. Predefined sets are sets of values mapped to statements, they are computed prior to performing analysis. DFAGen includes predefined sets for the set of variables defined, set of variables used, and set of expressions within a program statement. DFAGen also provides a mechanism for compiler writers to implement their own predefined sets with C++ code. III. DESIGN OF DFAGEN DFAGen s process of reading a specification file, analyzing its data-flow equations, and dumping data-flow analysis implementation code, is broken into five phases. The first phase is to parse a DFAGen specification file, encode the information it contains, and construct abstract syntax trees (ASTs) for the GEN and KILL equations. The second phase is to run type inference analysis. This phase assigns type information to each node in the GEN/KILL equation ASTs. The third phase is to analyze type information to determine whether the specification leads to a legal analysis. The fourth phase is may/must inference. At each predefined-set reference in a data-flow equation it is necessary to determine whether it refers to the may or must variant of the set. It is possible to derive this information from the style of the analysis and the context of the reference. The style of the analysis is either may or must depending on whether the analysis results in a set of data-flow values that may be true at a program statement or the set of such values that must be true. This phase is a major contribution of the DFAGen project - such analysis has not previously been incorporated into analysis generation tools. May/must inference works by analyzing ASTs in a top down fashion tagging nodes as either upper or lower bound depending on the parent node s operation and tag. How a predefined set reference is tagged determines whether it should be for the may or must variant. The fifth phase generates analysis code. The code generator dumps C++ source files that can be linked against the OpenAnalysis framework [5]. OpenAnalysis is a framework for implementing compiler analyses and a collection of such analyses, written in a manner that allows for the separation of analysis from the details of intermediate program representation. IV. EVALUATION AND CURRENT STATUS Table I shows the number of source lines of C++ code used to implement liveness and reaching definitions analysis in both hand-written and DFAGen generated implementations. The Specification LOC column shows the number of lines of code in a DFAGen specification file of the analysis. A compiler writer developing a data-flow analysis defined in terms of variable use and definition would only need to write these seven lines. Predefined set LOC refers to how many lines of C++ code are used to specify the def and use predefined set structures. Since many analyses will only use structures included with DFAGen, and user defined structures can be shared

5 3 Analysis Manual LOC Automatic LOC Specification LOC Predefined set LOC Liveness Reaching Definitions Table I LINES OF CODE IN MANUAL AND DFAGEN GENERATED ANALYSES Benchmark Benchmark SLOC Liveness manual time Liveness automatic time Reaching defs manual time Reaching defs automatic time 470.lbm mcf 1, libquantum 2, bzip2 5, sjeng 10, hmmer 20, Table II EVALUATIONS WITH SPEC C BENCHMARKS across multiple analyses, we believe predefined set LOC will not play a large role in most analysis specifications. Predefined set code is copied into DFAGen generated code, the manually written version of the analysis will include similiar code. Table II shows the time to execute the manual and DFAGen generated analyses on a number of the SPEC C benchmarks. Currently, generated analyses take longer but we believe that a number of simple optimizations, particularly applied to the generation of predefined-sets, would lead to comparable times to solution for hand written versus generated analyses. V. RELATED WORK A number of tools such as Sharlit [6], PAG [4] [2], and the specification language AG [7], allow compiler writers to generate data-flow analyses. However, none of these tools directly address the may/must issues of aliasing and all of these tools specify analyses in an imperative style. We believe DFAGen s declarative approach of specification is more intuitive. However, such specification does restrict DFAGen to the class of locally-separable analyses. VI. CONCLUSIONS Implementing data-flow analysis even within the context of a data-flow analysis generator is complicated by the need to handle issues such as may and must pointer, side-effect, and aggregate information. DFAGen is an analysis generator tool that given succinct specifications can generate all necessary implementation details. The presented tool depends on the availability of code for the generation of may and must versions of sets such as definition and use sets, but we present techniques that enable the tool to infer when may and must versions of the predefined sets are needed. Future work includes extending DFAGen to non-separable data-flow analyses, producing analyses that perform quicker, and including a tuple data type within the specification language. ACKNOWLEDGMENT This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under award #ER We would like to thank Paul Hovland and Amer Diwan for their comments and suggestions with regard to this paper. REFERENCES [1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools, second edition. Pearson Addison Wesley, [2] M. Alt and F. Martin. Generation of efficient interprocedural analyzers with PAG. In Static Analysis Symposium, pages 33 50, [3] G. A. Kildall. A unified approach to global program optimization. In ACM Symposium on Principles of Programming Languages, pages , October [4] F. Martin. PAG an efficient program analyzer generator. International Journal on Software Tools for Technology Transfer, 2(1):46 67, [5] M. M. Strout, J. Mellor-Crummey, and P. Hovland. Representation-independent program analysis. In Proceedings of the The sixth ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE), September [6] S. W. Tjiang and J. L. Hennessy. Sharlit a tool for building optimizers. In The ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), [7] J. Zeng, C. Mitchell, and S. A. Edwards. A domain-specific language for generating dataflow analyzers. Electronic Notes in Theoretical Computer Science, 164(2): , 2006.

6 Seasonality in Vulnerability Discovery in Windows Operating Systems HyunChul Joh and Yashwant K. Malaiya Abstract Being able to estimate the vulnerability discovery rates allows the developers to plan for resource allocation for patch development after releasing a software. Recently, quantitative vulnerability discovery models have been proposed that use the calendar time after release as the controlling factor. Some other models use the installed base to estimate the vulnerability finding effort. While some of the vulnerability discovery models fit the date well for several real data sets, they have not examined some of the specific factors that impact vulnerability discovery. This study examines whether vulnerability discovery rates exhibit an annual seasonal pattern using the seasonal index and autocorrelation function approaches. A time series analysis that can combine the longer term trends with cycles caused by seasonality may predict the future pattern more accurately. Here, the data sets for four major Windows operating systems are analyzed obtained from National Vulnerability Database. The analysis shows that there is indeed an annual seasonal pattern with higher incidence during the middle of winter and summer seasons. TABLE I: VULNERABILITY DISCOVERY SEASONAL INDEXES Win NT Win XP Win 2000 Win 2003 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec p-value E E E Keywords Security Vulnerabilities, Time series analysis, Seasonality, Vulnerability Discovery Models. O I. INTRODUCTION PERATING systems form the complex foundation for computing systems. A large number of vulnerabilities are discovered in OSs every year which represent a major security risk. If we can predict the vulnerability discovery rate and the attributes of the vulnerabilities discovered, we can allocate the needed resource at the right time for corrective measures, which can greatly reduce the security risks. While several time-based and effort-based vulnerability discovery models (VDMs) have been recently proposed [1], researchers have not examined the potential seasonal effect in software vulnerability discovery process. In the paper, we analyze the vulnerability data sets of four Windows OS vulnerabilities to identify possible seasonal patterns using two approaches, seasonal index with chisquare test and the autocorrelation function (ACF). Both the time-based and effort-based VDMs postulate that higher installed based for a specific software attracts increased vulnerability finding effort. However some of the HyunChul Joh is a graduate student in the Department of Computer Science, Colorado State University, Fort Collins, CO USA ( dean2026@cs.colostate.edu). Yashwant K. Malaiya is a professor in the Department of Computer Science, Colorado State University, Fort Collins, CO USA ( malaiya@cs. colostate.edu). Fig 1. Seasonal Indexes for the four OSs (Table I) deviation of the actual data from the models is not easily explained. Kim [2] has shown that shared code between two successive releases can impact the discovery rate. A visual examination of the data suggests a seasonal pattern. This study addresses the question Is there indeed a seasonal pattern that is statistically significant?

7 Fig 2. Vulnerabilities discovered for each month along with the calendar time for Windows OSs II. VULNERABILITY DATA AND ANALYSIS A. Source of the data sets The vulnerability data used here can be accessed at the National Vulnerability Database (NVD) which is available for public use. NVD is the U.S. government repository of vulnerability management data collected and organized using specific standards. This data is intended to permit automation of vulnerability management, security measurement, and compliance [3]. Here, we examine four major Windows OSs, Windows NT, Windows XP, Windows 2000, and Windows Figure 2 shows the number of vulnerabilities found along with the time line for the OSs. The middle of summer (midyear) and winter seasons (year-end) appear to have most of the peaks suggesting the possibility of seasonality in the discovery process. We examine the possibility systematically. B. Seasonal index & Chi-square test A time series data is not uniformly distributed and seasonal patterns are present in a data set when certain months must have more incidences of vulnerabilities reported than other months. Table I shows seasonal indexes for each month of the four OSs. A seasonal index shows how much the average for that particular period tends to be above or below the grand average [4]. Figure 1 shows that seasonal index values for middle of winter and summer tend to have higher frequencies, significantly above one which is the expected value. To evaluate the significance of non-uniformity of the distribution of the data sets, we conducted chi-square test for the grand total of each month against the expected value (total vulnerabilities divided by 12). In Table I, we can see that the four OSs yield extremely small p-values, thus we have a strong evidence of non-uniform distributions of vulnerability discovery rates [4], where the null hypothesis is that there is no seasonality in the data set. C. Autocorrelation function analysis The autocorrelation function (ACF) in time series analysis is calculated by computing the correlation between variable value and successive values of the same variable after some time lag. In other words, ACF measures the linear relationship between time series observations separated by a lag of k time units [5, 6]. Hence, when an ACF value is located outside of defined confidence intervals (CI) at a lag t, in an ACF it can be thought that every lag t, there is a relationships along with the time line. Figure 3 shows ACFs of the four OSs. The upper and lower horizontal dotted lines in each graphs represent 95% confidence intervals. Since summer and winter seasons have the majority of big peaks in Figure 1, we expect that lags corresponding to six months or its multiple would have their corresponding ACF values outside the CI. In figure 2 (a), lags for 5, 6, 11, 24, and 35 months are

8 Fig 3. Autocorrelation function for the four Windows OSs (the lag is in month) outside of CI; in other words, every 5, 6, 11, 24, and 35 months, there are strong autocorrelations which would confirm a seasonal pattern. In figure 2(b), lags for 5, 6, and 18 months in figure 2(c), for 5 and 18 months in figure 2(d), for 5, 6 and 35 months are significantly different from zero of ACF which confirms seasonal patterns in Windows OS vulnerability discoveries. The same approach had been applied in [6, 7, 8] to prove seasonality in their data sets belonging to other fields of research. III. CONCLUSIONS We have demonstrated that there is a seasonal pattern for Windows OSs, using seasonal index, and autocorrelation function. The results show that annual seasonal patterns are found with higher incidence during the middle of winter and summer seasons. Further research is needed to identify why security vulnerabilities tend to peak in the middle of summer and winter seasons more than others. One possibility is that some major computer security related conferences happen in summer and winter times, thus the potential conference participants might have a higher incentive [9] to find the vulnerabilities to brag about. Also, Rescorla [10] mentions that a large number of vulnerabilities are reported at the end of years because of an artifact of end-of-year cleanup. The future work can include predicting future vulnerability discovery trend using Box-Jenkins Model (ARIMA) which uses ACF analyzing, and applying the result to the AML [1] model to improve the vulnerability discovery predictions. REFERENCES [1] O. H. Alhazmi and Y. K. Malaiya, "Application of Vulnerability Discovery Models to Major Operating Systems," IEEE Trans. Reliability, March 2008, pp [2] J. Kim, Y. K. Malaiya and I. Ray, "Vulnerability Discovery in Multi- Version Software Systems," Proc. 10th IEEE Int. Symp. on High Assurance System Engineering (HASE), Dallas, Nov. 2007, pp [3] National Institute of Standards and Technology. National Vulnerability Database. [Online] Available: March 31, 2008 [4] Hossein Arsham. Time-Critical Decision Making for Business Administration. [Online] Available: March 31, 2008 [5] Bruce L. Bowerman and Richard T. O connell, Time Series Forecsting.2 nd Edition, Boston: Duxbury Press, 1987, pp. 31. [6] M. Rios, J. M. Garcia, J. A. Sanchez, and D. Perez, A Statistical Analysis of the Seasonality in Pulmonary Tuberculosis, European Journal of Epidemiology, Vol. 16, No. 5. (May, 2000), pp [7] Nancy Tran and Daniel A. Reed, "Automatic ARIMA Time Series Modeling for Adaptive I/O Prefetching," IEEE Transactions on Parallel and Distributed Systems, Vol. 15, No. 2, February 2004, pp [8] Anne Senter. A Summary of Forecasting Methods. [Online] Available: eseries1.htm, March 31, 2008 [9] A. Arora and R. Telang, "Economics of Software Vulnerability Disclosure," IEEE Security and Privacy, Jan. 2005, pp [10] E. Rescorla, Is Finding Security Holes a Good Idea? IEEE Security & Privacy Vol. 3, No. 1. Jan-Feb 2005, pp

9 Use of A New Trust Model for Making Reasoned Decisions in Different Security Contexts Sudip Chakraborty and Indrajit Ray Computer Science Department, Colorado State University {sudip, indrajit}@cs.colostate.edu Abstract Security services rely to a great extent on some notion of trust. However, there is no accepted formalism or technique for the specification of trust and for reasoning about trust. In this paper we present an overview of a new trust model [1] and discuss how this model helps to make reasoned decision in different security contexts. For example, in access control for open and distributed systems [2] or, for finding a trusted path to deliver data from a source to a destination in an ad hoc network [3]. I. INTRODUCTION Conventional security mechanisms using cryptographic techniques, credentials, provide us a notion of hard security where we assume that nothing can go wrong or the systems will behave exactly as it is expected to. However, presence of malicious entities or entities with unknown identity, it is difficult to guarantee the above assumption. Therefore, we need an alternative soft approach which can reason about certain level of uncertainty involved in different security contexts. The notion of trust provides such reasoning, where certain level of trustworthiness of an entity indicates the level of assurance that the entity will behave according to our expectation. For this reason, there is a need to incorporate trust in current security services such that the security decisions are guided by reasoning about the trustworthiness of the entities. For this purpose, several trust models [4], [5], [6], [7] have been proposed which differ from each other in semantics, representation, and evaluation of trust. This shows lack of agreement for the specification of trust and for reasoning about trust. In this paper we give an overview of a trust model, proposed in [1], which is more generic and flexible to reason about trust. A trust relationship in the model is represented as a tuple where the components are parameters that influence trust. This tuple also has a numeric value between [ 1, 1] { } to represent the notion of different degrees of trust. We discuss how this model can help to make more reasoned and fine-granular decision about trust in different security paradigm. For example, in access control for open and distributed systems where the user population is dynamic and the identity of all users are not known in advance [2] or, for finding a trusted path to deliver data from a source to a destination in an ad hoc network [3]. II. OVERVIEW OF THE TRUST MODEL Here we present an overview of the trust model proposed in [1]. In this model we specify trust in the form of a trust relationship between two entities a truster A and a trustee B. The relationship is a three element tuple where the components are, experience, properties, and recommendation. These are the factors influencing the trust. This is represented as A B = ( A E B, A P B, A R B ), where A E B represents the magnitude of A s interaction about B, A P B represents measure of B s attributes as evaluated by A, and A R B represents the collective effect of recommendation about B from other entities. Each of the parameters is evaluated within the numeric range [ 1, 1] { }, where a positive value of the parameter influences in increase in trust, negative value influences in increasing distrust, and 0 influence neither way. When we do not have sufficient information to evaluate a parameter, we use value. Relative importance (weight) of the parameters are assigned by defining a tuple (W E,W P,W R ), where each W i [0,1] and W E + W P + W R = 1, and then taking a component-wise multiplication to derive normalized values of the parameters. We also associate a single value, called trust value, to the above trust tuple indicating the trust level that A has on B. This value is a function of the normalized values of AE B, A P B, A R B. A positive value indicates trust, a value in the negative range indicates distrust, 0 is the neutral position where there is neither trust nor distrust, and indicates the unknown level i.e., when

10 there is not enough information to decide about trust, distrust, or neutrality from the parameters. In the following sections we discuss how the above model can be used in different security contexts. III. ACCESS CONTROL DECISIONS IN OPEN SYSTEMS Conventional access control models like role-based access control (RBAC) [8] are suitable for regulating access to resources by known users. These models have often found to be inadequate for open and decentralized multi-centric systems where the user population is dynamic and the identity of all users are not known in advance. In such systems, it is difficult to assign and regulate appropriate roles to these large population of users, which may contain malicious user(s). A. Outline of the Approach Here we outline the approach, proposed in [2], for using the trust model to make access control decisions in above scenario. The idea is to extend conventional RBAC model with the notion of trust. In our approach, we borrow the idea of role-permission binding from RBAC. Permissions are set of actions defined on set of objects. Object-action binding or role-permission binding are done according to the system s policies. Now, instead of assigning roles to individual users, roles are tied to different trust levels, according to the access control policies of the system. A user, by attaining certain trust level, will be able to activate the roles associated with that trust level. The user s trust is evaluated, using the proposed model, based on a number of factors like user s credentials (to evaluate properties), user behavior history (to evaluate experience), recommendation about the user (to evaluate recommendation). This trust will change as the user s behavior or properties or recommendation changes, and the user s access privileges will automatically change. For example, a new user, upon producing certain credentials, will attain certain trust level (may be very low) and the system will allow the user to have some basic role. Producing more credentials in future or behaving well over time will increase the trust level of the user and the system will allow more advanced roles in future. However, malicious activities performed by the user or negative recommendations will lower the user s trust level and the user s access privileges will be automatically downgraded. Note, to use the model efficiently for assigning and regulating access privileges, attention must be given while assigning roles to trust levels. That is, the access control policies must be designed in such a way that higher privileges are associated with higher trust levels. IV. FINDING TRUSTED PATH TO SEND DATA IN AD HOC NETWORK The reliable transmission path requirement imposes significant challenges in ad hoc environments. An ad hoc network can seldom assume a reliable network infrastructure for communication, where mobility of nodes is frequently considered an asset. In addition, such network may involve low capability devices (in terms of computation, storage, power), use of strong cryptographic techniques is not feasible. Moreover, in hostile environments these nodes get easily compromised. Under such circumstances it will benefit an ad hoc network if a path provides most opportunity of reliable delivery of messages. We have formulated the above problem as a routing problem in ad hoc network [3]. Routing protocols in ad hoc environment using different metrics like signal stability [9] or forwarding behavior [10] have been proposed. Next we outline our proposal, presented in [3], to use a trust-based routing where the trustworthiness of nodes is evaluated using the trust model. A. Outline of the Approach We assume that each node N r in the network has a trust relationship with its neighbor N e (that is a node at 1 hop distance). A node periodically sends a beacon message (something like an I am alive message and carries information necessary to prove the node s existence) to its neighbors. A node computes the experience component using the forwarding behavior (i.e., no. of packets send, no. of packets dropped) of its neighbor. The properties of a neighbor is based on the signal strength received by the node. Whenever the node receives a beacon message from the neighbor, the extended device driver interface of the receiving node measures, using receive signal strength indicator (RSSI) unit 1, the signal strength at which the beacon was received. We also assume other neighbors, upon request, agree to provide recommendation about a specific neighbor. Using these information the trust value, v(n r N e ), is evaluated. We next convert this trust value to a cost on the link (N r, N e ). These two are related as follows: higher the trust, lesser is the cost and the cost increases as the distrust increases. Rationale is, the cost (in terms of integrity violation and other malicious activities) of forwarding a packet 1 RSSI is the IEEE standard for measuring radio frequency energy sent by the circuitry on a wireless network interface card.

11 through a more trustworthy node is less than that through a less trustworthy node. The cost is minimum (not zero though) when N r has absolute trust on N e. This minimum cost (Min cost ) is a small positive cost incurred due to forwarding overhead. It is uniform over the whole network and set at the bootstrapping of the system. We assume that the decay in cost with increased trustworthiness is logarithmic with the following conditions: at v(n r N e ) = 1, cost = Min cost and at v(n r N e ) = 1, cost =. The function is defined as, cost(n r, N e ) = Min cost ln( 1 + v(n r N e ) ) (1) 2 The maximum allowable cost for N r is incurred when v(n r N e ) = 0 i.e., when N r is neutral about trustworthiness of N e. The path having the least average cost from the source node to the destination node is considered the most reliable among the available paths and is chosen by the source node to forward the data. We assume that if N r has a distrust value (i.e., value less than 0) on N e, then N e is discarded as a next hop. Figure 1(a) describes pictorially the main idea of our protocol. (a) Trust relations among neighbors (b) Forwarding cost on the links (c) Least average cost path Fig. 1. Trust relation between nodes and the corresponding cost on the link All these information (next hop, avg. cost, #hops etc.) are stored in the routing table of a node. When a node receives a route discovery request from a source, it checks its routing table. If a route to the destination is present, which has not expired, it sends the next hop and cost related information to the source. The source then evaluates the cost of the link between the neighbor and itself and using the #hops computes the average cost of forwarding the packet to the destination. The source may get multiple such responses. In such a case, it chooses the next hop for which the average cost over the path is minimum. If the node that receives a route discovery request from the source, does not itself have the next hop information for that destination, it initiates a route discovery process as a source. V. CONCLUSION We often take trust-based decisions in different security contexts. However, to use trustworthiness to make reasoned decisions in different contexts, we need a generic and flexible model of trust. In this paper, we present a new model of trust that incorporates different independent factors to evaluate trust. It defines trust as a quantitatively measurable entity with potentially infinite number of levels. We discuss how this model can be used to assign and regulate access privileges to a large and dynamic population of users in an open system. Advantage of the model to regulate access privileges is that we can have as many trust levels as we want and can bind each level to a different set of role-permission binding. This way we can achieve more fine-granular access control. We also discuss how the model helps to find the most trusted path to send data in an ad hoc network. The advantage of determining the path using trust, rather than by hop count (shortest path), is we can have more assurance about delivery of the data to a destination. REFERENCES [1] I. Ray and S. Chakraborty, A Vector Model of Trust for Developing Trustworthy Systems, in Proceedings of the 9th European Symposium of Research in Computer Security (ESORICS 2004), Sophia Antipolis, France, September 2004, pp [2] S. Chakraborty and I. Ray, TrustBAC Integrating Trust Relationships into the RBAC Model for Access Control in Open Systems, in Proceedings of 11th ACM Symposium on Access Control Models and Technologies (SACMAT 06), CA, USA, June 2006, pp [3] S. Chakraborty, N. Poolsappasit, and I. Ray, Reliable Delivery of Event Data from Sensors to Actuators in Pervasive Computing Environments, in Proceedings of 21st Annual IFIP WG 11.3 Working Conference on Data and Applications Security (DBSec 07), CA, USA, July 2007, pp [4] R. Yahalom, B. Klein, and T. Beth, Trust Relationship in Secure Systems: A Distributed Authentication Perspective, in Proceedings of the 1993 IEEE Computer Society Symposium on Security and Privacy, CA, USA, May [5] C. Jonker and J. Treur, Formal Analysis of Models for the Dynamics of Trust Based on Experience, in Proceedings of the 9th European Workshop on Modelling Autonomous Agents in a Multi-Agent System Engineering, Berlin, July [6] A. Abdul-Rahman and S. Hailes, Supporting Trust in Virtual Communities, in Proceedings of the 33rd Annual Hawaii International Conference on System Sciences, Hawaii, USA, January 2000.

12 [7] L. Li and L. Liu, A Reputation-Based Trust Model For Peer- To-Peer Ecommerce Communities, in Proceedings of IEEE Conference on E-Commerce (CEC 03), CA, USA, June [8] D. Ferraiolo, R. Sandhu, S. Gavrila, R. Kuhn, and R. Chandramouli, Proposed NIST Standard for Role-Based Access Control, ACM Transactions on Information and Systems Security, vol. 4, no. 3, pp , August [9] R. Dube, C. D. Rais, K.-Y. Wang, and S. K. Tripathi, Signal Stability-Based Adaptive Routing (SSA) for Ad Hoc Mobile Networks,, IEEE Personal Communications Magazine, vol. 4, no. 1, pp , February [10] C. Zouridaki, B. L. Mark, M. Hejmo, and R. K. Thomas, A Quantitative Trust Establishment Framework for Reliable Data Packet Delivery in MANETs, in Proceedings of the 3rd ACM Workshop on Security of Ad Hoc and Sensor Networks (SASN 05), VA, USA, November 2005, pp

13 A Structured-Outputs Method for Prediction of Protein Function Artem Sokolov Asa Ben-Hur Colorado State University, Fort Collins, Colorado, Abstract We apply the structured-output methodology to the problem of predicting the molecular function of proteins. Our results demonstrate that learning the structure of the output space yields better performance when compared to the traditional transfer of annotation method. 1. Introduction We address the problem of automatic annotation of protein function using structured output methods. The function of a protein is defined by a set of keywords that specify its molecular function, its role in the biological process and its localization to a cellular component. The Gene Ontology (GO) imposes a hierarchy over the keywords and is considered the current standard for annotating gene products and proteins (Gene Ontology Consortium, 2000). Computational methods for annotating protein function have been predominantly following the transfer of annotation paradigm where GO keywords are transferred from one protein to another based on the sequence similarity between the two. This is generally done by employing a sequence alignment tool such as BLAST (Altschul et al., 1990) to find annotated proteins that have a high level of sequence similarity to the un-annotated query protein. Such variations on the nearest-neighbor methodology suffer from serious limitations in that they fail to exploit the inherent structure of the annotation space. Furthermore, annotation transfer of multiple GO keywords between proteins is not always appropriate, e.g. in the case of multi-domain proteins (Galperin & Koonin, 1998). Since proteins can have multiple functions, and those functions are described by a hierarchy of keywords, we formulate prediction of protein function as a hierarchical multi-label classification problem and apply structured output prediction methods to it. This work focuses on the structured-perceptron which we use as an alternative to the BLAST nearest-neighbor methodology. Empirical results demonstrate that learning the sokolov@cs.colostate.edu asa@cs.colostate.edu structure of the output space yields improved performance over transfer of annotation. In our experiments we use BLAST to define the input space features as well as to limit the output space during inference. We demonstrate that failure to limit the output space can be detrimental to the prediction accuracy. In future work we will explore the use of more sophisticated methods of structured output prediction, such as maximum margin classifiers (Tsochantaridis et al., 2005; Rousu et al., 2006). 2. Methods Prediction of protein function can be formulated as a hierarchical multi-label classification problem as follows. Each protein is annotated with a macro-label y = (y 1, y 2,..., y k ) {0, 1} k, where each micro-label y i corresponds to one of the k nodes that belong to the hierarchy defined by the Gene Ontology. The microlabels take on the value of 1 when the protein performs the function defined by the corresponding node. Whenever a protein is associated with a particular micro-label, we also associate it with all its ancestors in the hierarchy, i.e. given a specific term, we associate with it all terms that generalize it. Note that the Gene Ontology consists of three distinct hierarchies: molecular function, biological process and cellular component. In this work we focus on the molecular function hierarchy. We train a linear classifier to predict the molecular function of proteins. Given a protein characterized by x in the input feature space X, we make inference for the most likely label according to: ŷ = h(x) = arg max f(x, y w) y Y where Y is the set of possible macro-labels we are willing to consider. The function f(x, y w) : X Y R can be thought of as a compatibility measure between an input x and an output macro-label y. We assume the function is linear in w, i.e. f(x, y w) = w T φ(x, y) in some space defined by the mapping φ. We train the classifier using a variant of the perceptron algorithm generalized for structured outputs (Collins,

14 A Structured-Outputs Method for Prediction of Protein Function Algorithm 1 Perceptron for Structured Outputs Input: training data {(x i, y i )} n i=1 Output: parameters α i,y for i = 1,..., n and y Y. Initialize: α i,y = 0 i, y. repeat for i = 1 to n do Compute the top two scoring labels: ŷ arg max y Y f(x i, y α) ȳ arg max y Y\ŷ f(x i, y α) if ŷ y i then Handle misclassification: α i,yi α i,yi + 1 α i,ŷ α i,ŷ 1 else if f(x i, y i ) f(x i, ȳ) < γ then Handle margin violation: α i,yi α i,yi + 1 α i,ȳ α i,ȳ 1 end if end for until a terminating criterion is met 2002). Given a set of n training examples {(x i, y i )} n i=1, the algorithm attempts to find the vector w such that the decision function values for the correct output and the best runner-up are separated by the user-defined margin γ: w T φ(x i, y i ) max y Y\y i w T φ(x i, y) > γ To make use of kernels, we assume that the weight vector w can be expressed as a linear combination of the training examples: w = n α j,y φ(x j, y ). j=1 y Y This leads to reparameterization of the decision function in terms of the α coefficients: n f(x, y α) = α j,y K((x j, y ), (x, y)) j=1 y Y where K : (X Y) (X Y) R is the joint kernel defined over the input-output space. In this work, we take the joint kernel to be the product of the input space and the output space kernels: K((x, y), (x, y )) = K X (x, x )K Y (y, y ). For the output-space kernel, K Y, we use a linear kernel; the output-space kernel is described below. The general routine for learning the coefficients α is presented in Algorithm 1. In our application, the terminating criterion is taken to be a limit on the number of iterations. i. 3. Experimental Results We propose a loss function we call the kernel loss and argue for its use in hierarchical classification problems since it generalizes F -measure used in information retrieval (van Rijsbergen, 1979). Details will be provided elsewhere. K Y (y, ŷ) (y, ŷ) = 1 KY (y, y)k Y (ŷ, ŷ) = 1 y T ŷ yt y ŷ T ŷ We used the data from the following four species: C. elegans, D. melanogaster, S. cerevisiae and S. pombe. Our experiments followed the leave-one-species-out paradigm, where we withheld one species for testing and trained the perceptron on the remaining data, rotating which species got withheld. This variant of cross-validation simulates the situation of annotating a newly-sequenced genome (Vinayagam et al., 2004). Prior to making predictions, we ran the data through several steps of preprocessing. First, we removed all annotations that were discovered through computational means as these were generally inferred by sequence or structure similarity and would introduce bias into any classifier that used sequence similarity to make a prediction. Second, we expanded the set of annotations associated with a protein to include all ancestor nodes of the nodes it was annotated with; for simplicity we considered a subset of the GO hierarchy called GO-slims. We then ran BLAST for each of the proteins in our dataset against all four species, removing the hits where the protein was aligned to itself. We employed the nearest neighbor BLAST methodology as our baseline. For every test protein, we transferred the annotations from the most significant hit against a protein from another species. Proteins with e-values above 10 6 were not considered in our experiments. The structured-output perceptron is provided exactly the same data as the BLAST method. The inputspace kernel is an empirical kernel map that uses the negative-log of the BLAST e-values that are below 50, where the features were normalized to have values less than 1.0 and the input vectors are normalized to be unit vectors. The inference during training was limited to only those macro-labels that appear in the training dataset. We call this space Y 1. For inference of test sample labels we considered three different output spaces, Y 1, Y 2, Y 3, in order to examine the effect of the size of the search space on prediction accuracy. We define Y 3 (x) to be the set of macro-labels that appear in the significant BLAST hits of protein x (e-values below 10 6 ). Ad-

15 A Structured-Outputs Method for Prediction of Protein Function Test on C. elegans D. melanogaster S. cerevisiae S. pombe Output Fold size Space BLAST NN 0.390(0.258) 0.278(0.264) 0.221(0.252) 0.223(0.240) - Perceptron 0.403(0.254) 0.280(0.262) 0.221(0.242) 0.255(0.243) Y 1 Perceptron 0.404(0.260) 0.265(0.271) 0.204(0.244) 0.221(0.243) Y 2 Perceptron 0.398(0.264) 0.263(0.271) 0.199(0.242) 0.222(0.243) Y 3 Random 0.507(0.217) 0.527(0.208) 0.529(0.200) 0.490(0.217) - Table 1. Empirical results comparing the performance of the traditional transfer-of-annotation method to the structured outputs approach. Presented are mean kernel loss per protein with the standard deviation values in parentheses. For comparison, we also include the performance of a random classifier that transfers annotation from a training example chosen uniformly at random. ditionally, we define Y 2 (x) to be the set of all subsets of macro-labels that can be obtained from the microlabels in Y 3 (x), with the constraint that each macrolabel represents three leaf nodes of the hierarchy at the most. These label spaces satisfy: Y 3 (x) Y 2 (x) Y 1. The results are presented in Table 1. When the output label space is limited to Y 2 or Y 3 during testing, the structured perceptron algorithm outperforms the BLAST nearest-neighbor classifier. The larger labelspace Y 1, results in the inference procedure considering annotations that are irrelevant to the actual function of the test protein, which reduces the prediction accuracy. However, even in this case, the perceptron maintains competitive performance compared to the BLAST nearest-neighbor method. The results support our hypothesis that learning the structure of the output space is superior to simple transfer of annotations. Note that the classifiers performed poorly when testing proteins from C. elegans. This is due to the fact that a vast majority of proteins in this species are annotated as protein binders (GOID: ). Such annotations contain little information from a biological standpoint and result in a skewed set of output labels. However, removing the species or the micro-label from the analysis lowers prediction accuracy suggesting that there is relevant information in the input space features captured by the dataset. We have shown here that a structured output method performs better than a nearest neighbor method when provided with the same information. Our structured output method can be enhanced in several ways to further boost its performance: Additional information can easily be provided in the form of additional kernels on the input space that use other forms of genomic information (e.g. protein-protein interactions); the structured-perceptron can be replaced with maximum margin classifiers; and furthermore, semi-supervised learning can be used to leverage the abundance of available sequence information. In future work we will also consider larger datasets that include a larger number of species. References Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D. (1990). Basic local alignment search tool. J. Mol. Biol, 215, Collins, M. (2002). Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. Proceedings of the ACL-02 conference on Empirical methods in natural language processing-volume 10, 1 8. Galperin, M. Y., & Koonin, E. V. (1998). Sources of systematic error in functional annotation of genomes: Domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biology, 1, Gene Ontology Consortium (2000). Gene ontology: tool for the unification of biology. Nat. Genet., 25, Rousu, J., Saunders, C., Szedmak, S., & Shawe-Taylor, J. (2006). Kernel-Based Learning of Hierarchical Multilabel Classification Models. The Journal of Machine Learning Research, 7, Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large Margin Methods for Structured and Interdependent Output Variables. The Journal of Machine Learning Research, 6, van Rijsbergen, C. (1979). Information retrieval. London: Butterworths. Vinayagam, A., K onig, R., Moormann, J., Schubert, F., Eils, R., Glatting, K.-H., & Suhai, S. (2004). Applying support vector machines for gene ontology based gene function prediction. BMC Bioinformatics, 5, 178.

A Taxonomy of Capabilities Based DDoS Defense Architectures Vamsi Kambhampati vamsi@cs.colostate.edu Christos Papadopoulos christos@cs.colostate.edu Dan Massey massey@cs.colostate.edu Abstract Distributed Denial of Service (DDoS) attacks pose immense threat to the Internet.

16 A Taxonomy of Capabilities Based DDoS Defense Architectures Vamsi Kambhampati Christos Papadopoulos Dan Massey Abstract Distributed Denial of Service (DDoS) attacks pose immense threat to the Internet. In this paper, we explore a new class of DDoS defense architectures called capabilities. These architectures advocate a fundamental change to the Internet architecture, so that senders must obtain permission from a receiver before they are allowed to send traffic. Our work re-examins the existing proposals on capability architectures from ground up, and identifies crucial challenges in building a capabilities enabled Internet. To this end, we develop a taxonomy of capability architectures with the intent to better understand the architectural changes, and the engineering tradeoffs to make the paradigm shift in the Internet. Index Terms DDoS, Capabilities I. INTRODUCTION The open nature of the Internet has allowed significant security threats from DDoS attacks. In the Internet, any source can send traffic to any destination, without consent from the destination. A DDoS attacker exploits this openness, by sending unwanted traffic from several compromised machines (botnets) distributed across the Internet, with the intent of exhausting the limited network and/or host resources available at a target. For example, in a typical bandwidth exhaustion attack shown in Figure 1, the attacker sends large volume of unwanted traffic to a target, causing severe congestion at routers. Traffic from the attacker and legitimate clients suffer packet loss. However the attacker gains most of the bandwidth due to its large traffic volume, giving less of the available bandwidth to legitimate traffic, thus causing denial of service to legitimate clients. In the above scenario, although the destination may be capable of differentiating between unwanted (i.e., attack traffic) and wanted (i.e., legitimate) traffic, unfortunately, it does not have the ability to prioritize wanted traffic in the network. In contrast, routers have the ability to prioritize traffic, but have little knowledge on what the destination desires. The problem, is that Internet lacks distributed enforcement of the rules set forth by the destination. The DDoS problem has led to a number of proposed defenses in the literature [4], [2], [1], [5], [7], [8], [3]. Of these, a particularly interesting recent class of DDoS defenses is based on network capabilities [7], [8], [6], [3]. These proposals advocate a change to the Internet architecture so that senders must obtain permission (capability) from the receiver before they are allowed to send traffic. Figure 2 shows an example of capabilities architecture. A sender that wants to communicate sends an initial request Fig. 1. attack host attack host legitimate traffic attacker traffic packet drops legitimate client congested router target Example of a bandwidth exhaustion DDoS attack packet to a receiver. Routers along the forwarding path insert pre-capabilities into these requests. Upon receiving the request the destination synthesizes a host-capability from precapabilities, and sends it back to the sender. Capabilities use cryptographic techniques so that routers can easily verify their validity. Subsequent data packets from the sender must carry capabilities; otherwise routers would drop unauthorized packets. Moreover, capabilities typically expire after a while and need to be refreshed, enabling the receiver to reject senders that misbehave. From this description, we view capabilities as a means for the destination to express its will to the network. Using capabilities, the destination can reject unwanted traffic from the DDoS attacker. However, despite the fact that many point solutions have been proposed, there has not been a rigorous study of the entire solution space for capabilities architectures. Our work aims to remedy this problem by re-examining capability architectures from the ground up. We believe adding a destination s will to the Internet is a major change that will inevitably introduce engineering trade-offs in effectiveness and deployability. To identify and understand these trade-offs, we propose to begin by building a taxonomy to categorize possible options and map the potential solution space. Our taxonomy identifies key components of a capabilities architecture, sets apart implementation specifics from fundamental requirements, and opens up challenging questions for building a capabilities enabled Internet. Through the taxonomy, we lay a solid foundation for determining whether capabilities should be included as a fundamental part of the Internet and, if so, set the direction for building effective and deployable capability architectures. The rest of paper describes our taxonomy (Section II). We conclude with a discussion of open challenges identified by our taxonomy and future work in Section III.

17 request pre-capabilities request destination border router (DBR) Capabilities Based DDoS Defense source Fig. 2. source border router (SBR) core routers host-capabilities response pre-capabilities destination Network model of a capabilities architecture. Traffic Classification Management Enforcement Fig. 3. Top level components of our taxonomy. Traffic Classification Enforcement Capability Management Decision Marking What to Enforce When to Enforce Setup Maintenance Traffic Categories 1.Wanted 2.Unwanted 3.Unclassified How to Decide!local policy When to Categorize 1.Always 2.On Demand 3.Under Attack Decision Location 1.Destination Border Router 2.Destination What to Mark 1.Packets 2.Flows How to Mark!Attach capabilities When to Mark 1.Always 2.On Demand 3.Under Attack Marking Location 1.Source 2.Source Border Router 3.Core Routers 4.Destination Border Router 5.Destination 1.Packets 2.Flows How to Enforce 1.Drop 2.Forward 3.Change Marking 1.Always 2.On Demand 3.Under Attack Enforcement Location 1.Source Border Router 2.Core Routers 3.Destination Border Router 4.Destination What to Setup When to Setup!Capabilities 1.Always 2.On Demand 3.Under Attack Communication Models 1.Pull-based 2.Push-based What to Maintain When to Maintain!State 1.Always 2.On Demand 3.Under Attack How to Maintain!State (a) (b) (c) Fig. 4. Sub-components of the taxonomy II. A TAXONOMY OF CAPABILITY ARCHITECTURES We start our discussion with the description of a simple network model that shows the important components of a network utilizing capabilities. Figure 2 shows the network model. First, our network model shows several sources, any of which could be a DDoS attacker or a legitimate client. We assume the destination is the victim of a DDoS attack. Second, we assume the network is distributed (similar to the Internet) with different administrative authorities controlling different parts of the network. Note that agreement to a common policy in a distributed network is difficult. Third, our model shows several important locations, including the source and destination. A source and destination border router (SBR, DBR in the figure), which act as the administrative boundary between the source, destination and rest of the network, and core routers which provide transit service for packets. Equipped with the network model, we now discuss the top level components of our capabilities taxonomy. As noted earlier, we view capabilities as a means to push the will of the destination to the network. To this end, our taxonomy identifies three top-level categories, namely: i) Traffic classification, which describes the decision and marking of packets into different traffic categories. The traffic categories capture the will of the destination and marking conveys the will to the network; ii) Enforcement, discusses implementing traffic policies (i.e., will) specified by the destination on the traffic categories; and iii) Capabilities management, deals with setup and maintenance of capabilities. Figure 3 shows the three top-level categories. In general, our taxonomy asks important questions regarding these categories, such as: i) who is involved in deciding, enforcing and managing, ii) where, iii) how and iv) when to decide, manage and enforce. The rest of this section explores these questions and describes the subcategories in detail. A. Traffic Classification The first step in capability architectures is to classify traffic into various traffic categories. This process involves deciding on the different traffic classes and subsequently marking the packets. Figure 4(a) shows the taxonomy of traffic classification. 1) Decision: The decision step tries to capture the destination s will. More specifically, this will relates to three traffic classes: i) wanted, is the traffic that the destination desires to receive, ii), unwanted is the traffic that may be dropped, and iii) unclassified is the traffic that destination is uncertain about (for example, request packets). Apart from the destination, a DBR is also capable of making decisions since it has knowledge about the resources under attack. However, beyond these two locations, other location do not have sufficient knowledge to make decisions. Deciding which packet belongs to which categories is a matter of local policy at the decider. Finally, decision can be made at all times, even when there is no attack, or on demand upon request from a some other entity and when the network is under attack. 2) Marking: Once the decision is made, the network needs to mark packets with the respective traffic categories, since without it, the destination has no way of expressing its desire. Marking involves adding bits needed to carry capabilities to packet header. Conceivably, marking can be done on individual packets, or on flows. We omit a specific flow definition, but allow any (or all) fields of a packet to define a flow. Marking can take place at any of the five locations described in our network. Note that this model allows partial deployment, where a source may not have the necessary changes made to mark, but a SBR can mark instead. Wanted traffic is marked using valid capabilities. Unwanted and unclassified packets do not carry capabilities, but are differentiated using other means. B. Enforcement The goal of capabilities is to give the destination explicit control over what it wants and does not want to receive.

18 Enforcement fulfills this requirement by taking appropriate action on the traffic classes. Figure 4(b) shows the enforcement process. Enforcement, similar to marking, takes place on individual packets or on flows. All locations except the source could enforce. The source is not a good choice, since it could be a DDoS attacker. Under enforcement, packets (or flows) are either dropped or allowed to pass. Specifically, the enforcer verifies capabilities included in packets (or flows), and if successful, allows the traffic to pass. Unwanted and unclassified traffic do not have capabilities, so they always fail verification. However, the enforcer may either choose to drop such traffic, or allow it to pass depending on its local choice (enforcement consumes resources, which the enforcer may be trying to save). Having multiple enforcement locations is important in this case. C. Capability Management Delivering the will of the destination to the network requires establishing capabilities (i.e., state) between the decider, marker and enforcer. Capability management deals with the communication involved in establishing the state, and updating the state when necessary. We identify two steps under capability management, setup, and 2) maintenance. Figure 4(c) shows the capability management taxonomy. 1) Setup: Setup answers the question of getting the state at the decider into the network. However, the decider does not know who the markers (or senders) might be. The setup process is thus responsible for discovering these nodes and establishing capabilities between the decider the marker. Capabilities established during setup may not last forever, otherwise, the destination has no control over misbehaving senders (a sender may start out well-behaved, but later misbehaves). There are two choices for a decider, either to explicitly revoke capabilities, or implicitly allow them to timeout. The latter choice matches the Internet model of soft-state signaling, and we imagine most capability architectures follow this model. Unlike decision, marking and enforcement, a specific location choice for setup (and maintenance) does not matter, since it is always between two entities (for example, decider and marker). How to setup depends on the communication models in use. These could be pull based, where the markers requests for capabilities, or push based, where the decider preestablishes few capabilities with potential markers. 2) Maintenance: Since capabilities established during setup expire, a maintenance phase is required to re-establish capabilities between the decider and marker, without going through setup. Moreover, maintenance is required to handle changes in the decision process at a decider, and path changes that occur during the communication between a source and destination. Maintenance is thus responsible for refreshing the state (i.e., capabilities) between a decider and marker. Similar to setup, a specific location for maintenance does not matter. However, refreshing state requires understanding the requirements at each of the decider, marker and enforcer. III. CONCLUSIONS AND FUTURE WORK In this paper, we tackle an important security problem facing the Internet; DDoS attacks. Specifically, we investigate a new class of DDoS defense architectures called capabilities. These architectures suggest a paradigm shift to the traditional Internet communication model; one that disallows a source from sending traffic without the consent of a destination. We develop a taxonomy that lays out a systematic solution space, to better understand broader design challenges and fundamental trade-offs associated with capability architectures. We believe our taxonomy motivates future research into capability architectures, and acts as a tool to evaluate future proposals in comparison with existing work on capabilities. To our knowledge, previous proposals on capabilities did not bring out the concept of deciders, markers and enforcers, and neither did they understand the essential components needed to build a capabilities architecture. Our work clearly defines these players, and identifies the three essential categories that make up a capabilities architecture. Moreover, we were able to bring out open challenges that need more attention. For example, determining who is responsible for enforcement directly impacts overall effectiveness, cost of deployment, the state requirements at routers. In addition, we identify challenges with respect to, where to locate state, and to what extent. How to manage and refresh state. These questions have implications in terms of trade-offs between damage caused (security) verses choice of location, and overhead imposed due to capabilities. We intend to evaluate existing proposals in context of our taxonomy, and show how they fit into our taxonomy. We also intend to explore few solutions in the open challenges described above. Specifically, we are looking into capability architectures that leverage border routers to reduce the overhead and deployment costs. REFERENCES [1] A. D. Keromytis, V. Misra, and D. Rubenstein. SOS: Secure Overlay Services. In ACM SIGCOMM, pages 61 72, August [2] R. Mahajan, S. M. Bellovin, S. Floyd, J. Ioannidis, V. Paxson, and S. Shenker. Controlling high bandwidth aggregates in the network. SIGCOMM Computer Communications Review (CCR), 32(3):62 73, July [3] B. Parno, D. Wendlandt, E. Shi, A. Perrig, B. Maggs, and Y.-C. Hu. Portcullis: Protecting Connection Setup from Denial-of-Capability Attacks. In ACM SIGCOMM, August [4] S. Savage, D. Wetherall, A. Karlin, and T. Anderson. Practical Network Support For IP Traceback. In ACM SIGCOMM, pages , August [5] M. Walfish, M. Vutukuru, H. Balakrishnan, D. Karger, and S. Shenker. DDoS Defense by Offense. In ACM SIGCOMM, pages , August [6] L. Wang, Q. Wu, and D. D. Luong. Engaging Edge Networks in Preventing and Mitigating Undesirable Network Traffic. In Workshop on Secure Network Protocols (NPSEC), October [7] A. Yaar, A. Perrig, and D. Song. SIFF: A Stateless Internet Flow Filter to Mitigate DDoS Flooding Attacks. In IEEE Symposium on Security and Privacy, pages , May [8] X. Yang, D. Wetherall, and T. Anderson. A DoS-limiting Network Architecture. In ACM SIGCOMM, pages , August 2005.

19 CS DEPARTMENT RESEARCH SYMPOSIUM 1 Classifier Bias in Protein Function Prediction Mark F. Rogers and Asa Ben-Hur Abstract Annotating proteins with their functions is critical for biologists who wish to isolate promising proteins for research from vast protein databases. Laboratory experiments yield the most accurate annotations, but are costly and time-consuming. Automated annotation algorithms are thus attractive alternatives, even at the cost of reduced accuracy. Many of the annotations in protein databases are therefore based on computational predictions. In view of the importance biologists place on good annotations, development of novel methods for protein function prediction is an active area of research in bioinformatics. Many researchers assess their methods accuracy without reference to the source of the annotations they use to training their models. This can lead to over-optimistic results: Given that many existing annotations are based on sequence or structural similarity, it is no surprise that a classifier that uses such information can predict these annotations with high accuracy. We illustrate this phenomenon in a set of controlled experiments using a simple nearest-neighbor classifier to make predictions that are based on PSI-BLAST similarity scores. I. INTRODUCTION Biologists rely extensively on annotations in protein databases when conducting their research. Protein annotations provide information such as a protein s molecular functions, the processes in which it participates, and the cellular locations where it is found. The most accurate annotations come from laboratory experiments, which are often time-consuming and expensive. The overwhelming amount of protein sequence data that genome sequencing generates makes it impossible to annotate all newly sequenced genomes experimentally. To obtain annotations more quickly and cheaply, researchers turn to automated tools such as the Basic Local Alignment Search Tool (BLAST) and its successors [1], [2]. In addition, machine learning researchers have developed a variety of classifiers that annotate proteins based on data such as BLAST scores, protein interaction data, and microarry gene expression data, (e.g., see [3], [4], [5]). To capture key protein characteristics from a variety of annotation methods, biologists developed the Gene Ontology (GO) [6]. The ontology is comprised of three three hierarchical namespaces containing 22,000 terms that de- M. F. Rogers is with the Department of Computer Science, Colorado State University, Ft. Collins, CO rogersma@cs.colostate.edu A. Ben-Hur is with the Department of Computer Science, Colorado State University, Ft. Collins, CO asa@cs.colostate.edu scribe the different protein characteristics the function it performs on a molecular level, the cell compartments where it resides, and the biological processes in which it participates. The gene ontology also provides evidence codes that permit researchers to characterize the method used to ascribe an annotation to a protein. The evidence codes distinguish, for example, between annotations based on laboratory experiments and computational predictions. Most researchers who design protein function classifiers ignore evidence codes when they design computational experiments to assess their classifiers accuracy. It is no surprise that their methods perform well in predicting annotations that were derived by computational methods, since they are usually based on similar information, such as sequence or structure similarity. In this study we demonstrate the impact of this potential bias to alert other researchers to this potential pitfall. II. METHODS Our experiments use a simple nearest-neighbor classifier that uses PSI-BLAST to measure protein similarity. A protein in the test set was characterized by its PSI-BLAST similarity scores with proteins in the training set. To predict a protein s annotation, the classifier found the protein in the training set with the highest similarity score and transferred its annotation. We assessed our classifier using the Leave-One-Species-Out (LOSO) cross-validation methodology where each cross-validation fold consists of the proteins of a single species. This mimics the scenario of having to annotate a newly sequenced organism. A. Selecting Evidence Codes For our experiments, we wanted to distinguish between evidence codes that might impart classifier bias and those that should not. We divided evidence codes into two main categories: nonbiasing and biasing. We added a third category for annotations we deemed unusable. A code s category depended on whether it used the same kind of information as our classifier. We divided GO evidence codes into three categories: nonbiasing, biasing, and unusable. For our classifier, biasing codes indicate that an annotation may have been derived using sequence or structural similarity. The GO contains three codes that may indicate reliance on sequence similarity (see Table 1). Nonbiasing codes indicate that an annotation was based on a laboratory experiment (as-

20 2 CS DEPARTMENT RESEARCH SYMPOSIUM GO Evidence Codes Code Description Class IDA wet-lab assay N IEP wet-lab assay N IGC genomic context N IGI wet-lab assay N IMP wet-lab assay N IPI wet-lab assay N TAS expert opinion N IEA electronic annotation B ISS sequence or structural similarity B RCA computational analysis B NAS non-traceable information U ND no data available U IC inferred by curator N/B/U TABLE I GO EVIDENCE CODES, THEIR MEANINGS AND THEIR CLASSIFICATIONS. N=Nonbiasing, B=Biasing, U=Unusable. say). We identified two evidence codes as unusable for our experiments; these describe instances where the annotation s information source is unknown, so we removed these terms from our experiments. Some annotations are Inferred by Curator (IC), which means that an expert used one annotation as evidence for another. For example, if a GO term shows that a protein participates in histone acetylation, a curator may add another term to show that it resides in the nucleus. However, these decisions may be based on annotations that come from sequence similarity or other electronic means. To account for this possibility, we assigned to each IC-coded term the evidence code for its reference annotation. B. Experiments We wanted to illustrate how bias can raise artificially the number of correct annotations for a classifier. To measure this effect, we observed our classifier s performance with and without biasing codes in the training set and noted how the change impacted the classifier s statistics. We performed leave-one-species-out (LOSO) testing in which we omitted one species from the training data and used training annotations to make predictions for the omitted species. For each species we conducted two tests to capture performance statistics: one for nonbiasing terms and a second for biasing terms. In the first test we included only nonbiasing annotations in the training data. In the second test we expanded the training annotations to include biasing terms as well as nonbiasing terms. This allowed us to use the same set of proteins for training in both nonbiasing and biasing paradigms, but it introduced another potential bias source. In general, a training set that has well-characterized proteins (proteins with many terms) will yield better predictions than one with poorly-characterized proteins. Thus we could improve our classifier s performance merely by adding terms to proteins in the training set. To address this issue, we had to ensure that annotations in the training set had the same number of terms in both nonbiasing and biasing tests. For nonbiasing training data, we simply selected each protein s nonbiasing terms. To create biasing training data, we randomly selected biasing and nonbiasing terms from each protein s full annotation set. In our experiments we focused on the GO molecular function namespace. To ensure there would be enough data for training and testing, we used the reduced slims ontology that contains 42 terms instead of the full ontology of 22,000 terms. Our classifier predicted multiple labels for each protein, so to compare the performance for different training sets we computed balanced accuracy scores for each GO term and combined them to compute overall classifier accuracy. When a classifier returned a predicted annotation for a protein, we compared two annotation sets: the actual set, A, and the predicted set, P. If we denote the set of GO terms G, then for every GO term t G we updated counts for true positive (TP t ), true negative (TN t ), false positive (FP t ) and false negative (FN t ) values for each prediction, using the following rules: TP t = TP t + I(t P t A) TN t = TN t + I(t P t A) FP t = FP t + I(t P t A) FN t = FN t + I(t P t A) (Here I( ) is the indicator function.) With these values, we could then compute the balanced accuracy B t for a classifier relative to term t: B t = [ FN t TP t + FN t + FP t TN t + FP t We then computed the mean classification accuracy µ B for a classifier using: C. Species Selection µ B = 1 G ] (1) B t (2) t G We selected species that had a large number of annotated proteins in the molecular function namespace to ensure that each GO term would have enough data for training and testing. We also wanted to select species that had enough annotations with nonbiasing terms that we could

21 ROGERS AND BEN-HUR: CLASSIFIER BIAS IN PROTEIN FUNCTION PREDICTION 3 ensure consistency in our training sets. Thus we selected five widely-studied organisms for our experiments: three yeast species (C. albicans, S. cerevisae and S. pombe), the fruit fly (D. melanogaster) and the nematode C. elegans. III. RESULTS Biasing S.cerevisiae Wilcoxon Signed-Rank Tests Balanced Accuracy Statistics for τ = 6 Nonbiasing Biasing Species p-value Mean Median Mean Median CA SP CE DM SC All TABLE II STATISTICAL SIGNIFICANCE OF THE DIFFERENCE BETWEEN BIASING AND NONBIASING CODES IN A LOSO EXPERIMENT ON FIVE SPECIES. Our results show that using biasing codes significantly improved classifier accuracy. D. melanogaster and S. cerevisae are the two species that have the largest number of annotations. For D. melanogaster, the mean classification accuracy increased from to 0.736, while for S. cerevisae it increased from to (see Table II). We observed slight increases in C. elegans as well, but these changes were not statistically significant. Overall balanced accuracy increased from to Figure 1 presents graphs that compare individual GO term performance for biasing and nonbiasing codes in S. cerevisae and D. melanogaster. The influence of bias is most pronounced for D. melanogaster, where the classifier performed better with biasing terms than with nonbiasing terms in nearly every case. IV. DISCUSSION We have presented a set of experiments that demonstrate how bias may influence the accuracy of predicting protein function. For two of the five species studied, the effect was pronounced and statistically significant. These results suggest that if researchers do not take this kind of bias into account, they may report statistics that artificially inflate a model s performance. These results also have implications for experiments that compare different algorithms. If two protein annotation methods leverage different kinds of information, they may have different sets of nonbiasing and biasing terms, making the comparison meaningless. A statistically valid Biasing Nonbiasing D.melanogaster Nonbiasing Fig. 1. Bias impacts classifier accuracy for D. melanogaster and S. cerevisae. Each point represents a GO term. Points above the diagonal line show that accuracy was higher for biasing terms than for nonbiasing terms. experiment should then use only the nonbiasing terms common to both models, further restricting the amount of annotated data available for testing. If the models nonbiasing terms do not overlap, more work may be required to eliminate bias, or a valid comparison may not be possible. REFERENCES [1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, Basic local alignment search tool., J Mol Biol, vol. 215, no. 3, pp , October [2] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, Gapped BLAST and PSI BLAST: a new generation of protein database search programs, Nucleic Acids Research, vol. 25, pp , [3] Y. Zhou, G.M. Huang, and L. Wei, UniBLAST: a system to filter, cluster, and display BLAST results and assign unique gene annotation, Bioinformatics, vol. 18, no. 9, pp. 1268, [4] S. Letovsky and S. Kasif, predicting protein function from protein/protein interaction data: a probabilistic approach, Bioinformatics, vol. 19 (Suppl. 1), pp. i197 i204, [5] M. Deng, T. Chen, and F. Sun, An integrated probabilistic model for functional prediction of proteins, in RECOMB 03: Proceedings of the seventh annual international conference on Research in computational molecular biology, Berlin, Germany, 2003, pp , ACM Press. [6] M. Ashburner, C.A. Ball, J.A. Blake, H. Butler, JM Cherry, J. Corradi, K. Dolinski, JT Eppig, M. Harris, DP Hill, et al., Creating the gene ontology resource: design and implementation, Genome Res, vol. 11, no. 8, pp , 2001.

22 Predicting Number of Incidents Exploiting a Vulnerability Sudip Chakraborty and Yashwant K. Malaiya Computer Science Department, Colorado State University {sudip, malaiya}@cs.colostate.edu Abstract One way to understand and deploy preventive security mechanism, where the system has the ability to protect itself from external attack, is to get idea about attack or intrusion process. One of the good metric to understand attack process is number of attack incidents. In this paper we present a mathematical framework to model the temporal distribution of attack incidents related to a vulnerability. The model is based on number of observed incidents related to the vulnerability. Such model helps to estimate number of exploit incidents in future, thereby giving us some idea about the extent of preventive mechanism that needs to be taken. I. INTRODUCTION Formalizing a system s security attributes in quantitative terms or defining a quantitative measures for system security are some of the recent trends in security research. Factors like number of known vulnerabilities, number of residual vulnerabilities of the system etc. have been considered for this purpose [1]. However, the system s owner is responsible to control these factors to achieve a certain security level. Alternatively, there is another approach of defining security attributes by quantifying factors related to an attacker s perspective. For example, time spent to launch an attack against certain system, effort in terms of resource, time, money, manpower or a combination of all involved in an attack, incentive of an attack etc. are some of the factors that are related to attacker. To develop a quantitative measure of security by this second approach we need information regarding the aforesaid factors, which is difficult to obtain. Therefore, from owner s side, an indirect way to get idea about the attacker s intention and other information of the factors is to study the exploitations or intrusions. By analyzing the number of exploitation and the rate of exploitation we can have an idea about the security level of the system against which the exploitations are made. One way to predict future number of exploitation is to model the rate of exploitation with a mathematical framework. This framework, developed using existing data would represent the actual exploitation process. Using the model as a predictor we can estimate the number of incidents that are going to happen. This would help us to prepare for necessary preventive actions. In this work we propose a mathematical framework to model the rate of number of exploitation incidents related to a specific vulnerability (in Phf). Our objective is to estimate the future number of incidents exploiting the vulnerability by an attacker. We process the data about number of incidents related to Phf vulnerability to get the trend. Then a three parameter mathematical model is fitted to the trend data. A statistical analysis is done to analyze the degree of fit of the model to the data. II. APPROACH Our approach is motivated by the approach taken by [2], where the authors used three vulnerabilities with the highest incidence rate during namely, Phf, IMAP and BIND. In this work we consider only the Phf vulnerability. Phf is the name for a common gateway interface (CGI) program. The purpose of the Phf program is to provide a web based interface to a database of information - usually personnel information such as names, addresses, and telephone numbers. The vulnerability exploited in Phf was an implementation error. The Phf script works by constructing a command line string based on input from the user. While the script attempted to filter the user s input to prevent the execution of arbitrary commands, the authors failed to filter a new line character. As a result, attackers could execute arbitrary commands on the web server at the privilege level of the http server daemon, usually root [3] Browne et al. [2]. The data regarding number of exploits for Phf incidents in each month is obtained from the exploitation histogram (shown in Figure 1(a)) presented in [4]. We generate the cumulative number of exploits over the months, as shown in Table I. Due to space restriction we show a fragment of the table. A plot is generated on the cumulative count data. Temporal distribution of

23 Fig. 1. Phf incidents: (a) Exploitation histogram (b) Cumulative count plot Month index Months #Exploit Cum. count Month index Months #Exploit Cum. count 1 Jan Aug Feb Sep Mar Apr Aug May Sep Jun Oct Jul Nov TABLE I PHF INCIDENTS the events is analyzed from the plot and a mathematical framework is proposed to model the trend in number of exploit incidents. A curve corresponding to the equation of the model is fitted to the cumulative count plot and statistical analysis is done to measure the goodness of fit. III. MODEL AND ANALYSIS In this section we propose a mathematical framework to model the rate of incident related to Phf vulnerability. The cumulative count for Phf data are plotted against the month indices. The plot is shown in Figure 1(b). By observing the plot we see that there are three phases or sections in the trend. One is the starting phase, the next is the growing phase and the third is the saturation phase. This type of three-phase is also observed in [5] and [6] though they have different interpretation of the phases. Therefore, to model the exploitation rate, it is reasonable to think of a curve with three parameters to control these three phases. Hence, we use an S-shaped curve (sigmoid curve) with three parameters P,Q,R (shown in Figure 2(a). The parameter P controls the height of the S i.e., height of the saturation level; parameter Q controls the slope of S i.e., how steep or slant the growth was, and the third parameter R specifies the starting point of growth i.e., how fast the growth will start after the first incident. The mathematical equation of this S-shaped curve is given as follows: c = P 1 + PQe PRm (1) where, c = cumulative count and m = month-index. This equation is the proposed mathematical framework to model the rate of number of incidents for Phf vulnerability. The next subsection discuss the goodness of fit of the model to the data. A. Analysis To fit the model with the data, value of the parameters P, Q, R needs to be set. We set the value of P to be the total cumulative count observed i.e., the saturation value for cumulative count. Two initial values are guessed for the other two parameters. These two values are tuned keeping the value for P fixed till a good fit is observed (this is done by using least-square sum as the criteria). From the figure 2(b) we see that the model is well fitted to the data. To measure the goodness of fit, we calculate the estimated cumulative count using above equation and perform

P Q R Fig. 2. Phf incidents: (a) Three parameters (b) Proposed model fitted to cumulative count data Data type P Q R χ 2 R 2 Cumulative count 826 0.04893525 0.00032829 4.1933E-58 0.

24 P Q R Fig. 2. Phf incidents: (a) Three parameters (b) Proposed model fitted to cumulative count data Data type P Q R χ 2 R 2 Cumulative count E TABLE II SUMMARY OF THE S-CURVE FIT TO THE DATA a χ 2 independence test and an R 2 analysis. The χ 2 test returns the probability of independence of two data series observed and estimated. If a curve is a good fit to the data, then this probability should be small. The closer this value is to 0, the better is the fit. R 2 is known as coefficient of determination which describes the proportion of the observed variation in the count that can be explained by time. The closer this value is to 1, the better the fit. As a result of the χ 2 test we get the value as This extreme low probability of independence between observed and estimated data series proves the level of goodness of the fit. This is corroborated by the R 2 value, which is Table II summarizes the data related to the model and its goodness of fit. IV. CONCLUSION & FUTURE WORK Number of attack incident for some vulnerability gives a good measure of threat or risk involved regarding that vulnerability. To take necessary preventive mechanism against attacks related to vulnerabilities, it is needed to have some idea about the potential number of incident that might happen in future. By modeling the rate of incident happened so far, we can estimate the approximate number of future incidents. In this work we propose a three parameter equation to model the rate of incidents related to Phf vulnerability. The parameters represent three phases of the temporal distribution of the incidents. Statistical analysis on the goodness of fit of the model reveals that the model can explain the rate of incident to a good extent. Existence of this type of estimator can help an organization to plan their preventive resources accordingly thereby enhancing the performance of their risk management system. A lots of work remains to be done. Presently the model is proposed based on small amount of data related to Phf vulnerability only. Data related to other vulnerabilities need to be considered. Another important aspect is to set the values of the parameters in the context of cumulative exploitation count. At present setting of only one parameter s value (P ) has been explained. Last but definitely not the least, other mathematical model need to be considered together with a comparative analysis of their goodness of fit. REFERENCES [1] E. Rescorla, Is Finding Security Holes a Good Idea? IEEE Security & Privacy, vol. 3, no. 1, pp , Jan.-Feb [2] H. K. Browne, J. McHugh, W. A. Arbaugh, and W. L. Fithen, A trend analysis of exploitation, Department of Computer Science, University of Maryland-College Park, Tech. Rep. CS- TR-4200, November [3] C. A. CA , Vulnerability in ncsa/apache cgi example code, March [4] W. A. Arbaugh, W. L. Fithen, and J. McHugh, Windows of vulnerability: A case study analysis, IEEE Computer, vol. 33, no. 12, pp , December [5] O. H. Alhazmi and Y. K. Malaiya, Quantitative vulnerability assessment of systems software, in Proceedings of Annual IEEE Reliability and Maintainability Symposium, January 2005, pp. 14D [6] E. Jonsson and T. Olovsson, A quantitative model of the security intrusion process based on attacker behavior, IEEE Transaction on Software Engineering, vol. 23, no. 4, pp , April 1997.

Automatic Determination of May/Must Set Usage in Data-Flow Analysis

Automatic Determination of May/Must Set Usage in Data-Flow Analysis Andrew Stone Colorado State University stonea@cs.colostate.edu Michelle Strout Colorado State University mstrout@cs.colostate.edu Shweta