Specification-Based Unit Testing

Size: px

Start display at page:

Download "Specification-Based Unit Testing"

Merilyn Golden
6 years ago
Views:

1 Specification-Based Unit Testing Arcadio Rubio García Kongens Lyngby 2009 IMM-MSC

2 Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kongens Lyngby, Denmark Phone , Fax IMM-PHD: ISSN

3 Summary Automated generation of tests from specifications has been recently introduced in order to address some of the shortcomings of classical automated unit testing. Test cases are created from properties that a piece of code must satisfy, and which are given by the programmer. Clearly this leads to much more concise test code. However, the aforementioned approach still has a long journey ahead before being fully ready to replace classical unit testing in most of its applications. Throughout this MSc Thesis we try to contribute towards the solution of the two main problems it is facing. First, test cases are currently generated from properties in a pure random fashion. We explore different alternatives for creating them in a more intelligent manner, but still very efficiently, so that the probability of finding bugs becomes higher in practice. Second, properties are naturally suited to specify functional code, yet we need to develop new constructs to be able to apply them to stateful code, such as object-oriented software. We attempt to offer answers in this direction too. A prototype demonstrating our ideas for the resolution of both problems has been built as part of the Thesis. A snapshot of the source code at the time of writing can be found in the following URL. Further updates, if any, will also be made available on this site:

4 ii

5 Preface This thesis was prepared at the Department of Informatics and Mathematical Modelling in partial fulfillment of the requirements for acquiring a MSc in Computer Science and Engineering. It was written during the period 27 th October 2008 to 14 th April 2009 under the supervision of Associate Professor Hubert Baumeister, and amounted 30 ECTS points. Lyngby, April 2009 A. Rubio

6 iv

7 Acknowledgements I would like to thank my supervisor Hubert Baumeister, who has been very helpful with guidance, ideas and critical comments during the elaboration of this thesis.

8 vi

9 Contents Summary i Preface iii Acknowledgements v 1 Introduction Goals Outline Background Basic Concepts Testing Process Software Testing Methods Test Case Generation Strategies State of the Art 11

10 viii CONTENTS 3.1 Classical Unit Testing Frameworks Behaviour-Driven Unit Testing Frameworks Specification-Based Unit Testing Frameworks A Specification-Based Unit Testing Framework Design Principles Properties Testing Practices Unit Testing Strategies Test Case Generation Independent Progress Measurement Testing Object-Oriented Systems Complexity of Object-Oriented Systems Contracts as Properties Architecture Implementation Language System Architecture Design and Implementation Domain-Specific Language Properties and Contracts Runner Test Case Generation Strategies

11 CONTENTS ix 9 Conclusions Final Remarks Summary of Contributions Further Work

12 x CONTENTS

13 Chapter 1 Introduction We do not have a science of software engineering. Until we do, we should take advantage of late binding. Alan Kay This masters thesis deals with a central problem in software engineering: testing. And more concretely, it is concerned with testing in its most popular and closest to daily implementation form, unit testing, which involves checking the smallest components or units of a system in isolation. As pointed by Alan Kay, software engineering is still at a primitive stage where it seems that formal methods offer too little benefits in comparison with the costs involved in their usage. This is especially true in typical business environments, where requirements are often inconsistent and volatile. Hence, the Industry has turned to techniques that provide greater design flexibility at the expense of safety. This is the case, for instance, of dynamic dispatch, which is the basis of object-oriented systems. The situation is quite apparent and does not show signs of changing soon. Dynamically typed languages account up to the 40% of the total usage, which has almost doubled during this decade according to a popular index [4].

14 2 Introduction A large share of programmers have even abandoned the security of statically typed systems in quest of further flexibility. This may sound paradoxical since static type systems have been characterized as the one lightweight formal method that are guaranteed to be used [59]. In the current scenario, testing is the only practical way to ensure the quality of a big proportion of the software which is developed. What is more, unit testing represents the cornerstone of agile methodologies, meaning that it is used not only for quality assurance, but also for documenting and even for specifying the design [12], in an otherwise chaotic development process. Therefore, it seems crucial to have good unit testing tools, yet mainstream ones are quite primitive in our opinion. Let us briefly explain why. Testing is often modelled as a four-step process [70]: 1. Modelling the software s environment 2. Selecting test scenarios 3. Running and evaluating scenarios 4. Measuring test progress Standard unit testing frameworks, such as the ones belonging to the XUnit family [1], just automate the execution of test cases. The vast majority of the task is left to the programmer, who still has to choose adequate cases specifying both the input and the expected result as well as decide whether unit testing has reached its goals. Thus, software developers are only relieved from the third step, which is also the simplest one. Widely used frameworks do not go beyond that. They provide a lot of assertion constructs, which sometimes include mock and stub objects for better testing objects in isolation. However, this should not obscure the fact that they are mere execution tools for running in a single step code written with all the facilities they deliver.

15 1.1 Goals Goals The root of the problem lies in the lack of an explicit mechanism, an abstraction, for modelling the software s environment. This makes it impossible to automate the remaining steps of the process. Instead, programmers have to choose test cases and manually write code to exercise them. Typical frameworks only assist developers making easier the tasks that they need to perform manually, but they do not make this tasks unnecessary, as it has already been explained. Furthermore, operating at a lower level without an abstraction for modelling the environment means that the documenting value of the tests is reduced, as they only provide an implicit and informal definition of software s behaviour, scattered along different places. One has to go through different test cases to collect the parts that compose a specification. Hence, there is an evident risk of reaching an inconsistent or incomplete one. Not to talk about the burdens involved in making any changes to it. Recently, automated generation of test cases from specifications has been applied to unit testing in order to overcome the limitations of classical frameworks just mentioned [24]. It should be noted that automated generation of test data by itself is an old concept which is documented in the literature at least as early as in the 1970s [25, 52]. The key point in this idea is that test cases are generated with the aid of specifications. These are executable expressions written by the programmer, that receive input values and output a Boolean indicating whether the requirements are met for each test case provided. Thus, the task can be automated, as it merely consists on choosing inputs for the property and checking whether the outcome is true or not. Although an explicit and formal specification is present, the method employed is still testing and not formal verification. The specification is used for generating test cases with the aim of showing the presence bugs, not for mathematically proving their absence. Current specification-based frameworks are still immature due to their relative infancy. On the one hand, they generate test cases in a random fashion which can be a too restrictive strategy for revealing the myriad of different types of bugs that can potentially happen in today s software. On the other hand, as they have originated in functional programming environments, they do not offer facilities for testing object-oriented systems. We will try to make advances in both directions, providing: Better test case generation strategies for tackling different types of bugs Support for specifying object-oriented systems

16 4 Introduction 1.2 Outline First, in chapter 2, we cover the essentials of testing, necessary to elaborate the rest of the report. These include an idealized testing process, and the fundamental testing methods and techniques. Afterward, in chapter 3, we survey the existing unit testing frameworks, presenting their main technical details, virtues and limitations. The study is divided into the analysis of the popular XUnit family plus some recent derivatives, and the still research-oriented QuickCheck framework. Next, in chapter 4, our views on the state of the art are exposed along with our hypotheses, establishing what we think should the essentials of a specificationbased framework that accomplishes the goals listed in the previous section. Then, in chapter 5, we describe the different strategies employed by the framework for generating test cases automatically. We also propose a technique for measuring testing progress in a strategy-independent fashion. These two elements help us to fulfill our first goal. Subsequently, in chapter 6, we show how to extend specification-based frameworks to support object-oriented systems, our second goal. We unify the concepts of property and contract as in design by contract so that we can use the same construct for checking programs both at development-time and at runtime. Thereafter, in chapter 7, we describe the main architectonic aspects of the framework. These comprise how the system is structured into components, as well as the selection of a particular implementation language. Before concluding, in chapter 8, the internal design of the different components of the framework is explained. In particular, we cover all the aspects of the domain-specific language used for specifying properties. Finally, in chapter 9, we review the most important ideas concluded throughout the thesis and sum up our own contributions to the subject. We also offer some directions to carry on further work on the topic.

17 Chapter 2 Background Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence. The Humble Programmer Edsger W. Dijkstra With the intention of setting up the scene, and at the same time making the report self-contained, in this chapter we review the essentials of software testing. First, in section 2.1, we outline the basic concepts related to software testing. Among those, the idea that tests are only useful for disproving the correctness of software is perhaps the most important one exposed along the chapter. Next, in section 2.2, we present a general model of the testing process applicable to most testing activities. Its correct understanding is essential for later on recognizing the shortcomings of most unit testing frameworks. Finally, in sections 2.3 and 2.4, we sketch the main software testing methods and test case generation techniques, both human and automated.

18 6 Background 2.1 Basic Concepts We begin presenting one of the multiple existing definitions of software testing: Definition 2.1 Software testing Software Testing is an empirical investigation conducted to provide stakeholders with information about the quality of the product or service under test [39]. We prefer this definition to others as it clearly highlights the fact that software testing is not limited to finding defects in the logic of programs or bugs that cause them to malfunction. Instead, it is a much broader activity. It performs two major functions in relation to the quality of software. On the one hand, it checks whether the correct product is being built, also known as validation. On the other hand, it checks if the product is being built correctly, or verification [18]. Both are the main elements of every quality management system, such as ISO 9000, and not unique to software [65]. Aiming at low-level bugs, as we deal with unit testing, our work is concerned with verification. The two key terms in the definition of testing are empirical and quality. We have just discussed quality. Let us concentrate on explaining the implications of being practical process, or empirical investigation, as said in the definition. Testing contrasts with formal methods, which try to prove mathematically the absence of software faults. Being an empirical investigation, for testing that would imply trying all possible inputs to a program. This becomes impossible even for almost trivial ones. A small example of 59 LOC shown in [36] which we will not reproduce here for the sake of brevity is a good demonstration of this fact. A few loops and conditionals combined altogether produce a combinatorial explosion of the inputs needed for covering all cases. As a result, an exhaustive test would require 67 different test conditions, 368 logical paths and data values. Remark 2.2 Exhaustive testing is unfeasible, even for simple programs. Hence, testing can only expect to show the presence of bugs, as stated in the epigraph at the beginning of the chapter. This does not mean that testing is an approach that should be discarded in favour of formal methods. In fact, these are currently too expensive and do not provide enough benefits to replace testing in the majority of the projects. Remark 2.3 The only purpose of testing is to find errors.

19 2.1 Basic Concepts 7 The above principle is sometimes reformulated as follows: Remark 2.4 A successful test is a test that fails [63]. The contrast between formal methods and testing is in many cases not well understood. Whereas the former use a mathematical approach, the latter sometimes adopt ad hoc and not very rigorous methods based on recipes for each scenario, making it something close to a craft. We strongly believe testing should adopt a statistical approach, especially in the case of black-box testing, which we will define later. We can illustrate this by imagining testing as a game between the programmer and the tester [8]. Let us assume the programmer writes perfect software and sends it for testing. The tester calls the functions written by the programmer, but in its way back, the return values are sometimes altered by an evil daemon, as depicted in figure 2.1. Now, if the daemon introduces the bugs in a completely irrational way e.g., it introduces a bug every 1357 calls to the program it is easy to see how hard this makes the testing task. It will be very almost impossible to find bugs. However, if faults follow a random distribution, the tester has more chances to spot some. For instance, extreme values are known to be more risky [16]. From our point of view, testing should try to study the random distribution of bugs and elaborate strategies to generate test cases that are likely to uncover software faults. In fact, this is the main idea behind this thesis. Remark 2.5 Test cases should be chosen according to their probability of success [8]. Of course it is very difficult to establish concrete probabilities for test cases in a particular project, but surveys in the field like [7] analyze historical data proving or disproving facts about the distribution of errors, which are very valuable for creating test case generation strategies. Program call Tester return return Daemon Figure 2.1: Daemon introducing errors

20 8 Background 2.2 Testing Process Regardless of the kind of testing activity and methods employed, most can be modelled as a four step process [70], which is worth discussing before going any further. First, the tester has to describe the interaction between the part under test and its environment. Depending on the level of testing, this could be a procedure or an object interacting with others, a module, or a system as a whole communicating with the end user. This is the most fundamental step and involves creating a sound specification of the interface between both parts which can be a programming interface, a communications interface or a human interface among others. Second, test scenarios or cases have to be chosen taking into account the environment model previously elaborated. Since exhaustive testing is impossible, the finite set of cases selected has to represent economically the infinite set of inputs to the system. This is sometimes referred to as saying that it fulfills the test adequacy criteria. Third, test cases are run. This does not necessarily imply that cases are executable code. Again, testing is a broader activity. Running could imply performing some usability checks in a user interface sketch drawn in paper. After running the cases, the outcome is evaluated. Finally, the status of testing is assessed to determine whether enough of it has been done so far or not. Essentially this involves examining the cases run and deciding if they capture the set of inputs or not. Model software's environment Select test cases Run and evaluate cases Measure test progress Figure 2.2: Testing process

21 2.3 Software Testing Methods Software Testing Methods From now onwards, we focus solely on verification methods for code, which are a superset of the area of interest to the thesis: black-box unit testing. Next, we provide a few examples of different software testing methods, which depend on the level at which we desire to evaluate our software. The most basic method is unit testing. Along with the coding activities, the most basic components or units of the code are checked. Those units depend on the particular programming paradigm, and may be functions, classes or objects. It may take place immediately after building those, or even before if we adopt an agile approach [13]. After software modules are tested in isolation, their interaction has to be checked. This may be done in several ways such as top down or bottom up [63]. Integration testing methods have less relevance in methodologies that advocate for continuous integration. Another important approach is to look for regressions, which are bugs introduced in features that worked correctly in the past due to the side effects of some modification. Regression testing tries to uncover those. It is particularly problematic in the maintenance phase, where software may have become big enough so that it is no longer cost-effective to re-run all test cases after each change performed in the code. 2.4 Test Case Generation Strategies Different test case generation techniques can be classified according to the perspective from which software was viewed when they were designed. Black box testing involves examining software from a functional approach, whereas white box testing implies observing the internal structure of the software for generating cases Human test case generation On the one hand, there are two main black box test generation techniques employed by humans. First, equivalent partition, where the input domain to a program is divided among equivalence classes. The idea is to reduce the number of test cases necessary by considering sufficient to test one element from each class [16]. Second, boundary value analysis selects extreme input values, but also generates cases such that extreme outputs are produced [16]. It works under the assumption that errors are more frequent on extremes and contiguous values.

22 10 Background On the other hand, there are many white box approaches, dependant of the class of software being developed. Two relatively well-known ones are condition testing and loop testing. The first one tries to generate test cases such that all expressions in decisions of a program conditionals, switches, etc are evaluated both to true and false, as well as the subexpressions inside conditions. The second one classifies loops in four main categories [63] and defines cases that must be run in order to ensure that they work adequately in typical scenarios Automated test case generation Automated generation of test cases has been an active research topic of software engineering at least since the 1970s [25, 52]. Testing is a costly development activity and therefore there is a strong desire of automating it as much as possible. As a matter of fact, it may account up to the 50% of the total cost in some projects [63]. The different approaches that have been proposed throughout all this time can be classified into three main paradigms [6]: Early efforts [25] went in the direction of using symbolic values for variables in order to get information about the code by running the program with them, instead of using real values. This is generally referred to as symbolic test generation. The information captured is later on used for creating algebraic constraints which, in turn, can be used for deriving test cases. More recent approaches which use program analysis techniques [32], and in particular abstract interpretation [27], can be seen as a refinement of this technique. A simple approach is to generate test data randomly until the goals are achieved. That is, until the set of test cases created is thought to represent adequately all the possible inputs to the program. Random test case generation has been employed in the past for generating cases for white-box testing [17], but recent specification-based frameworks [24] also use this approach for black-box unit testing. An important detail is that the random distribution of test cases should follow that of actual data in order to be successful [34]. In dynamic test case generation programs are instrumented and test cases are run. The resulting information is used for refining the provided cases if necessary to try to achieve the test adequacy goal, which could be, for example, branch coverage. Usually a heuristic function is used for the guiding the refinement process. This function measures the distance between the test case and the adequacy criteria. Hence, the test case generation process is translated into a minimization problem. Recent approaches include using evolutionary techniques, such as genetic algorithms, since they are especially well suited for minimizing the non-linear functions that often appear using this kind of methods [51].

23 Chapter 3 State of the Art There is no single development, in either technology or in management technique, that by itself promises even one order-of-magnitude improvement in productivity, in reliability, in simplicity. No Silver Bullet: Essence and Accidents of Software Engineering Frederick P. Brooks, Jr. In this chapter we describe the current state of the art in the unit testing frameworks panorama. The idea is to give the reader a quick overview, presenting the main tools, outlining their virtues and deficiencies, to later on justify the principles behind our proposal. We begin our discussion in section 3.1, covering the so called classical unit testing frameworks, to later on turn to more recent approaches in sections 3.2 and 3.3. An important idea is that, although more advanced approaches offer progressively better solutions, testing is a hard problem due to the essential properties of software [20]. Therefore improvements are relatively modest from one tool to another from a quality enhancement point of view.

24 12 State of the Art 3.1 Classical Unit Testing Frameworks Classical unit testing frameworks are well represented by the XUnit family, being SUnit the precursor [11], and JUnit its most popular member [1]. The convention is that the prefix represents the language of implementation of each particular tool. So JUnit stands for Java XUnit, whereas SUnit is the Smalltalk implementation. This category of frameworks achieves a simple goal, automating the execution of unit tests, so that the developer is relieved from the burden of having to run test cases manually and check their outcome each time. Therefore, it is not hard to imagine that this idea had been applied earlier than mid 1990s, when SUnit was born, although it might not have circulated publicly. For instance, Taligent developed a strikingly similar framework written in C++ around 1991 [67]. An important fact about the XUnit family is that its representatives are implemented closely following the idioms of the language they are aimed for. Hence, some design details differ a bit from one framework to another. Our discussion here would be of the high-level concepts, so this is not that important. For the record, the examples have been adapted from the JUnit Cookbook [1]. There are several key concepts behind classical unit testing frameworks. First, the idea of fixture is very important. Fixtures represent the code that needs to be run before and after each test case is executed in order to set up the environment. Since the XUnit family is mainly directed towards object-oriented languages, which are predominant in the Industry, this is not surprising. Objects are inherently stateful, so different test cases are likely to share some code for configuring them and their collaborators before doing the real work. For example, the following code contains a fixture in the setup method creating different Money instances with different currencies and amounts. The method is run before each test case is executed: public class MoneyTest { private Money f12chf; private Money f14chf; private Money f28usd; public void setup() { f12chf = new Money(12, " CHF"); f14chf = new Money(14, " CHF"); f28usd = new Money(28, " USD"); }

25 3.2 Behaviour-Driven Unit Testing Frameworks 13 Another two basic concepts of classical frameworks are that of test case and assertion. Test cases contain the code running the scenario selected by the developer, once the environment has been set up by the fixtures. Inside test cases, assertions verify that the outcome of the test case is what the developer expected when he designed the case. The following code checks that adding two of the Money instances created by the fixture produce the correct public void simpleadd() { Money expected = new Money(26, " CHF"); Money result = f12chf. add( f14chf); asserttrue( expected. equals( result)); } The remaining concepts are test suite and runner. We do not illustrate suites here. They are simply thought for grouping collections of related test cases and other suites note the recursion thus establishing a tree hierarchy. This makes possible to run the adequate subset of cases when certain changes are made in the code with little effort. Runners are the elements responsible of executing the fixtures and cases. Different runners provide different features, such as complex GUIs or concurrency. The following code runs the previous test case using a plain runner: org.junit.runner.junitcore.runclasses(moneytest.class); We have just seen that, as outlined in the introduction, classical unit testing frameworks limit themselves to automating the third step of the testing process: running test cases and evaluating the results. However, from the code we have just presented, it is clear enough that neither in the second step selecting test cases nor in the fourth one evaluating progress this class of frameworks provide any assistance at all. 3.2 Behaviour-Driven Unit Testing Frameworks Behaviour-driven development (BDD) [54] is a recent spin-off from test-driven development (TDD) [13]. So we could think of behaviour-driven unit testing frameworks as derivatives from classical ones. To put it short, from a tool standpoint, BDD frameworks try to encourage programmers to write tests focussing on the behaviour of objects by providing two main features. First, a testing framework with a syntax close to a domain-specific language, so that test cases look more like specifications. Second, better assertion facilities for object-oriented software.

26 14 State of the Art Typical BDD frameworks include JBehave the pioneer or the XSpec family, with SSpec and RSpec [2] as its most prominent members. Incidentally the latter is the one used for testing the tool we develop throughout the thesis. Despite renaming unit tests to specifications, hence the XSpec acronym, BDD frameworks do not offer anything new from a quality assurance point of view. We could rewrite the previous example and it would result in a nearly identical code, with perhaps a slightly nicer syntax. The programmer still has to perform manually the same steps as with the classical frameworks. From a software process point of view, BDD frameworks stress the TDD ideas of testing first and using tests as a guiding element for the design. Hence, they provide stubs, which are elements that replace object collaborators giving canned responses to method calls. This allows to elaborate tests very early, even when the object collaborators are still not developed. They also stress the usage of mock objects, a construct for checking exchanged messages between objects. So we could argue that BDD frameworks enhance the assertion facilities of classical ones for object-oriented software, but are essentially the same. 3.3 Specification-Based Unit Testing Frameworks Specification-based frameworks lay on completely different concepts than classical and BDD ones. Although the ideas behind them automated generation of test cases from specifications are relatively old [52], their application to daily general-purpose unit testing is relatively recent. QuickCheck [24], a tool for Haskell, is the pioneer in this area, and has lead to a number of XCheck frameworks for other functional languages. These frameworks depart from classical ones in the very first step, as they define a property or Boolean function that acts as specification. For instance, the following property states that reversing a list of integers twice should result in the original list [24]: prop_revrev xs = reverse ( reverse xs) == xs where types = xs::[ Int] From there onwards everything is different. Having a explicit specification means that it is possible to automate all the remaining steps of the testing process. Test cases can be created and evaluated using the property. QuickCheck uses random generators for the types defined in the specification. Progress evaluation is also feasible by using the property or the code, for instance generating cases until code coverage is achieved. To sum up, specification-based frameworks seem as a step forward from the other two described.

27 Chapter 4 A Specification-Based Unit Testing Framework There is no royal road to geometry. Euclid, in reply to Ptolemy Up to this point we have presented the existing unit testing frameworks, outlining their virtues and their deficiencies. In this chapter we show the reader what we think should be the essentials of a framework that achieves the goals enumerated at the beginning of the thesis. Our ideas can be well summarized, without excessively oversimplifying them, by stating that we believe there is no unique test strategy that suffices for addressing the myriad of different classes of bugs that can be possibly introduced in the software built nowadays. Instead, we think that various complementary approaches for generating tests and measuring progress should be combined, just like different programming paradigms are used for solving different kinds of problems [5, 60]. In the first section we cover the foundational ideas of the framework, to then turn to the formal definition of properties, which are its most basic construct. Both are the stepping stone for the rest of the thesis. Finally we give a brief idea on a possible usage of the tool inside the development process.

28 16 A Specification-Based Unit Testing Framework 4.1 Design Principles Fundamental ideas There are a few key ideas that should be taken into account when building any testing framework. Exhaustive testing is impossible. From a theoretical point of view, equivalence checking of two functions is uncomputable [28]. In practice, trying all possible branches of execution becomes intractable even for trivial programs [21]. As a consequence, frameworks which do not use a formal approach should adopt a statistical approach, employing the finite resources available for testing where the probability of finding bugs is greater, as stated in the next point. Sadly, there are only a few broad and well proven evidences regarding the distribution of errors in software, as indicated below. Hence, it is the user who ultimately has a significant share of responsibility in the effectiveness of the framework by determining the distribution of the errors in the units under test, and setting up the tool accordingly. Software errors follow a Pareto distribution. 80% of the bugs come from 20% of the code. Therefore, although exhaustive testing is impossible, it is certainly feasible to achieve a big quality enhancement by concentrating the efforts in the most problematic modules. This idea has long been used in quality control processes of other industries successfully [38]. Empirical studies have confirmed the applicability of this principle to software [30, 7], and at the same time they have disproved other myths such that the number of lines of code (LOC) is a good estimator of the errors of a certain module. Apart from the Pareto principle, there are few other confirmed hypotheses concerning software faults, and even less that are applicable to a unit testing framework. Fault densities have been shown to be equivalent between similar projects, but that has little application to our problem domain. Interestingly, it has been demonstrated that bug rate remains constant for each module during different phases of development. Moreover, another longstanding myth, which says that complexity metrics are good predictors of faults, has been disproved [7]. That is, there is no statistical correlation between metrics such as cyclomatic complexity and the number of bugs of a certain module. There are two other principles that are generally accepted in other fields of Computer Science but that seem to be ignored by the literature regarding testing. We strongly believe this ideas should also be taken into account.

29 4.1 Design Principles 17 There is no golden road to unit testing as there is no golden road to programming. It is acknowledged that there is no best programming paradigm for solving all problems [5, 60], although some people insist on fitting them to the paradigm du jour, when sometimes there are noticeably better approaches. We think this idea is also valid for testing. For instance, random testing is adequate for finding bugs involving very complex conditions, but may frequently fail to identify simple ones, which is in many cases the easiest way to disprove the correctness of a unit of code. Besides, running test cases that failed in the past may also be a good manner to find faults. Humans have different abilities than machines. Therefore, any approach to testing that tries to mimic closely how humans look for bugs is doomed from the beginning. This idea is largely accepted in artificial intelligence [62]. Dijkstra put it short: The question of whether machines can think... is about as relevant as the question of whether submarines can swim. [29] Computers are able to beat world-class chess players, yet they use radically different techniques. They consider millions of movements per turn, whilst expert humans only think about a few. Humans exploit their ability to reason, whereas computers take advantage of their power to deal with huge sets of data. The same applies to testing. Humans can only cope with a few test cases, but they have the ability to solve complex constraint problems. That is why they employ techniques such as analyzing extreme values Problem constraints We have found a two important restrictions that limit the form the framework can adopt if we want it to be of practical application. A good balance between simplicity and complexity should be achieved. It has to be simple enough to be unobtrusive, and complex enough to provide better solutions than current ones. Simplicity is a must as unit testing frameworks are used heavily at in the trenches programming duties. Properties should be expressed in the host language. This is a key aspect derived from the previous simplicity constraint, important enough to be stated separately. We will later on show with an example that it only possible to do so in practice when the language supports first-class functions.

30 18 A Specification-Based Unit Testing Framework Unit testing is used for daily programming tasks where dealing with two different languages at the same time results in too much overhead for the developers. The XUnit family of frameworks enjoys wide popularity due to this reason. In contrast, more advanced tools such as the Java Modelling Language (JML) [44] have not left the academia despite they have much more to offer because they employ a domain-specific language. Of course in the latter case the decision was difficult, as JML attempts to implement among other things Design by Contract facilities, and Java is not expressive enough to be used for writing its own contracts concisely Design elements Finally, taking into account the points stated previously, we have chosen to base our framework on the following items. Random generation of test cases. This technique is a natural and flexible way to create cases to be run against properties, and as such is the only strategy employed by the majority of the specification-based frameworks [24]. Like we explained in chapter 3, default types have predefined generators that yield random values. The user can combine those generators if necessary to build new ones, either for generating values for different types, or for changing the random distribution so that the probability of finding bugs is greater. Exhaustive depth controlled enumeration of test cases. As shown in [61], random generation of test cases is a method for finding all classes of bugs, yet it fails to detect simple ones easily in some scenarios. According to the small scope hypothesis, if a program fails to meet its specification, almost always fails in some simple case [37]. Therefore, trying all simple cases up to a certain depth by employing a generator that provides them in such an ordering tends to be the easiest and fastest way to disprove the correctness of a piece of software. Testing old failed cases. It is desirable to reveal bugs as soon as possible. For this reason, users of classical unit testing frameworks have adopted the practice of writing a case per bug they find in the software being developed. Therefore, if at any point the bug is reintroduced, it will be discovered promptly. Current specification-based frameworks lack a mechanism for automating this, which is even more important in their case. Due to randomness, and the usual immense size of the set of inputs for each property, a reintroduced bug may not be discovered until a few executions of the tool. By recording failures and using simple scheduling techniques borrowed from regression testing, [42], it is possible to prove as soon as possible if a fault has been reintroduced in the software without incurring in too much overhead.

31 4.1 Design Principles 19 Human generated test cases. Enabling users to provide their own test cases is a good practical compromise. As illustrated previously, humans have different abilities than machines. In particular, they can easily solve constraint satisfaction problems, something which is still an open problem in its most general way for all types that appear in a program which is what we are concerned with. Or at least, they are able to find some solutions easily. Hence, the user is in a very good position to provide a few test cases that may find faults. What is more, he may want to perform white-box testing, a technique which has been left out of the framework, and these cases provide a good opportunity to do so. Property coverage. Different test case generation techniques have mechanisms for measuring the progress, but they are tightly coupled to them. For example, a simple way of measuring the effort employed so far in exhaustive testing would be to count the depth to which all cases have been generated. However, this hardly gives an objective, high-level, strategy-independent method of accounting the exhaustiveness of the test cases already run. A general approach could be to compute how far we are of running test cases such that all Boolean subexpressions of the property are evaluated to all possible values that make the whole expression true. Intuitively, the reader can see that this ensures test cases are evenly distributed, at least to a minimal extent. Later on, we will provide a formal proof of this fact. Genetic testing. By using dynamic evolutive techniques, and in particular genetic programming [6], it is possible to actively pursue property coverage. Traditionally these methods have been employed for refining test cases in order to achieve white box goals, such as branch coverage, as explained in chapter 2. But since properties are executable code, there is no reason that stops us from applying this technique to specification-based unit testing. Properties and design by contract. Design by contract [49, 50] specifications share many similarities with properties, still no system merges together both offering a unified approach, and at the same time provides the facilities typical of specification-based frameworks, such as random testing. This is mainly due to the fact that the majority of the specification-based systems focus on functional languages, where design by contract has little acceptance, but also due to a remarkable technical difficulty. Contracts are basically properties thought to be executed at runtime to verify specifications dynamically and act accordingly when they are not satisfied. Hence, polymorphism is used heavily as the methods decorated with contracts can be polymorphic. Since specification-based systems do not support polymorphic properties, it is not possible to merge both concepts in an unified one. We hope to overcome this difficulty by employing a dynamic language where the type system will not impose us such a barrier.

32 20 A Specification-Based Unit Testing Framework 4.2 Properties Basic definitions So far we have treated properties a bit informally, albeit it is true that they have been defined implicitly a number of times, and we also have explained how other frameworks deal with them. Let us present a formal definition. Definition 4.1 Property A property is a function without side effects f : T {true, false}, being T any type. This includes base types, composite types, as well as the special type that denotes functions of arity 0 1. Remark 4.2 Composite types are those built from base types, such as tuples, lists or classes. Example 4.1 Property p1 specifies that the length of strings must be distributive. property : p1 => [ String, String] do a, b (a + b). length == a. length + b. length end Clearly the image or return type of p1 is a Boolean since the operator applied in last place is that of equality ==. The input type T is a pair of strings (String String), or [String, String] according to our syntax. That is, a tuple. Tuples are composite types built from a list of types, two strings in this case. Every property of arity greater than 1 needs to be defined ranging from a tuple, as this is the only way to provide various parameters. We use a syntax whose concrete details will be explained in chapter 8. However, any reader familiar with a programming language that supports lambda abstractions is already capable of understanding it solely with one clarification. {} and do end are the two ways we employ for representing one and more than one line long lambda functions, respectively 2. We require to label explicitly the input type T of the lambda function for reasons explained a bit afterwards. Additionally, we also require to attach an identifier to the property for easing the task of referring to it later on, for example inside other properties. 1 This type is usually labelled unit, void or () in common programming languages. 2 For example, λ x y. x + y could be represented as { x,y x + y } or do x,y x + y end, usually employing the latter for long expressions, and introducing two carriage returns as shown in the example above.

33 4.2 Properties 21 Definition 4.3 Test case A test case for a property f is a value whose type T is the same as the one of the property f : T {true, false}. Example 4.2 A possible test case for the property p1 is the pair of strings ['a', 'b'], or ( a, b ) in mathematical notation. Property p1 evaluates to true when provided with the previous test case. Actually, there is no test case that makes it false since it is sound. Remark 4.4 The test case generation process consists on creating test cases for each property with the goal of falsifying it Property language Our syntax for defining properties is tied to the underlying implementation, as a major goal is to allow programmers to write specifications in the host language. Hence, our definition of property language, or the set of all functions we allow to use as properties, is dependent of it. Definition 4.5 Property language The property language is the set of all functions in the underlying implementation language Ruby that evaluate to true or false, and whose body is a single expression. Properties have been defined as purely functional constructs for obvious reasons. Being specifications, their execution must not have any side effects at all. Hence, it is reasonable to enforce them to be a single expression, as in practice this reduces the number of potential side effects, and makes easier to analyze them. In an imperative language this may significantly reduce the expressiveness of properties. The host language may not provide first-class functions, or lambda abstractions. Therefore, it may not be possible to capture simple specifications which contain existential or universal quantification without using imperative constructs. Example 4.3 Property p3 specifies that all Array instances created with its twoparameter constructor must only contain the elements indicated by the second parameter. This constructor creates an Array of size indicated by the first parameter, and filled with the element provided by the second one. (s, o) (N Object) e Array.new(s, o) e = o

34 22 A Specification-Based Unit Testing Framework property : p3 => [ Natural, Object] do s, o Array. new(s, o). all? { e e == o } end The universal quantification over the elements of the array needs to be captured using a loop in the absence of first-class functions. The property would have to be rewritten iterating explicitly over the elements of the array and verifying whether they are equal to the second parameter of the property or not, and returning true or false accordingly. Before going any further, the reader will have noticed the presence of an additional universal quantifier when rewriting the property using mathematical notation. This aspect is very important and deserves a clear observation. Remark 4.6 The input type T of a property is universally quantified in an implicit manner. The above remark explains why properties may need to be labelled explicitly indicating the type of T. Since properties are universally quantified over the input type, the user may want to specify a property that has a different input type than the inferred one. There are several reasons to do this. The scenario under specification may range over subtypes of the inferred type, either because the user has chosen to do so, or because the property is polymorphic, which is forbidden by most specification-based systems. What is more, in our case we deal with a dynamic type system. Hence, there is no type inference and type annotations need to be provided always. Otherwise, the framework has no type information for generating test cases Specifications and properties Definition 4.7 Specification A specification is the definition of the requested behaviour of a module of software. Properties are a simple way of writing specifications by means of Boolean functions. Thus, we can argue that properties are a type of specifications. Both terms are used interchangeably in some parts of the literature, e.g. [61], and in the present thesis. Remark 4.8 The terms property and specification are used as synonyms along the thesis, although the latter is broader than the former.

35 4.3 Testing Practices Degenerate cases Example 4.4 Property p2 specifies that empty arrays must have size 0. [] is the literal which represents an empty array. property :p2 { [].size == 0 } The property above is an example of one with arity 0. As we can see, this class of properties represent degenerate cases where the predicate takes no parameters. This is usual when testing procedural code, or when specifying scenarios that cannot be easily generalized. The previous example belongs to the latter category scenarios difficult to generalize as there is no obvious way to define the size of an array without auxiliary functions that do not depend on this size too. Remark 4.9 Properties of arity 0 have the peculiarity that no test cases can be generated for them. Hence nothing can be done for falsifying them other than just evaluating them once. 4.3 Testing Practices An interesting question is how to use a specification-based unit testing framework in the software development process. The only real case we found in the literature is the usage of QuickCheck for testing the Edison library of functional data structures [24, 57, 56]. Unfortunately, this survey is not of much utility from a process perspective, as it merely consisted on replacing existing unit tests with properties in an already built piece of software. Nonetheless, it is worth to mention that the user reported great success, reducing the code needed for testing in a 75%. Initial guidelines on using classical frameworks like JUnit suggested to transform all printouts and debugger expressions into unit tests [14]. With the advent of Extreme Programming [12], tests became a central part of an agile process with very small iterations. Programmers were encouraged to write their tests before the code. This practice has become even more radical in test-driven development [13]. New features begin with writing tests, which are run expecting them to fail. Then, as few code as possible is written to make the failed test pass. Tests have become a construct that guides the design. In specification-based testing we also advocate for writing properties first. This forces the programmers to concentrate on getting a formal specification initially, rather than focussing on coding.

36 24 A Specification-Based Unit Testing Framework Unlike in classical unit testing, the specification is explicit, and thus has better documenting value and does not tend to be inconsistent or incomplete. However, being very declarative and small, it is difficult to use properties for guiding the design. They simply cannot capture a scenario sufficiently big so that it makes sense to apply the principle of running the tests without the code to watch them fail. Furthermore, the framework will record old failing test cases to check them in future executions. Therefore, running properties without code makes it to store arbitrary test cases that failed for no reason other than there was no code. Nevertheless, this feature can be turned off during such scenarios if desired. Suggested process We sketch how to use the tool in a similar way to test-first approaches but taking into account the particular aspects of specification-based tools just mentioned. 1. For each minimal unit of code function or class elaborate a complete specification using properties. The specification may involve more than one property. This is especially true in classes, where there will almost always exist more than one contract, as the class will presumably have more than one method. In case of functions it is also usually the case that there are different properties for each one, stating different relationships with other functions. Moreover, each function can also have a contract. 2. Write as least code as possible so that the properties just written cannot be falsified. If these properties are too complex to be implemented in one iteration, it may be the case that you need to consider breaking the function or class into smaller parts. If it is not the case, you may also consider implementing code only to satisfy a subset of the properties just written. It could be useful to rewrite some properties composed from simpler ones, if you plan to satisfy only some of the cases stated by a single property. 3. Feed the properties you have just written code for into the framework along with the existing ones from previous iterations and run it. 4. Fix any found bugs in your code and go to the previous step. If there were no faults, go to the second step as long as you still need to write code for some existing properties you have not checked. Otherwise, go to the first step.

37 Chapter 5 Unit Testing Strategies On two occasions I have been asked, Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out? I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. Charles Babbage In first place, this chapter describes the different complementary strategies employed by the framework for generating test cases to disprove properties. Each strategy is aimed at a particular class of software faults, although some are more general than others. Moreover, strategies are backed up by empirical laws about the distribution of bugs, and provide mechanisms for measuring the effort employed in testing tightly coupled to them. All this aspects are covered with detail. It should be noted that some test generation strategies ultimately rely upon the user guidance in order to produce good test cases. Least to say, if bad guidance is provided, bad test cases will be generated. In second place, a strategy-independent way of measuring testing progress is defined. Such an approach is especially necessary in any framework which combines more than one test generation mechanism in order to have an objective benchmark of testing effort.

38 26 Unit Testing Strategies 5.1 Test Case Generation Test cases, as indicated by definition 4.3, are values whose type is the same one as the property they have been created for. The subsequent remark 4.4 stated that test case generation has the goal of providing cases such that properties are falsified. This is just an application of the more general principle, by now mentioned a couple of times, which states that the goal of testing is to show the presence and not the absence of bugs. More precisely, test case generation should provide test cases such that the probability of falsifying properties finding faults is as high as possible. As indicated in chapter 4, there are few general and well-proven facts about the distribution of bugs in software. Hence, it is very difficult to create a mechanism for generating cases that fits all scenarios. Instead, our philosophy is to provide a few strategies aimed at the different classes of bugs. A number of these rely upon the user guidance and fine tuning in order to generate good cases for particular scenarios. We briefly covered some of these approaches previously. For instance, random testing was sketched when describing the state of the art in chapter 3, as this strategy is in most cases the one and only employed by specification-based frameworks. Here the discussion will be a description geared towards how to integrate them in a unified framework with different strategies. That is, an abstraction process such that all strategies can be used in the same way. Low-level design details, such as the selection of a particular mechanism for yielding cases will not be covered until chapter 8. The same applies for the rest of the strategies which we present subsequently. This does not mean that our exposition is vague. Quite on the contrary, after reading this chapter the user will be in a good position for starting to use the system. We will just try to hide implementation dependent decisions whenever possible, and stress the generic design decisions valid for building any framework that relies upon the same principles as ours. For each strategy we provide a short section outlining the bugs it is directed to, along with the principles behind such approach. Next, we describe how the strategy works, to finally show how it measures its own progress.

39 5.1 Test Case Generation Random testing Motivation Random testing is the most natural way to generate test cases for a specification in an automatic way. The framework just takes the types indicated in the signature of the property and uses the generators associated to them, which yield random values according to some distribution. Generators for common standard types such as Integer, Boolean or String are provided by the tool. This strategy is only effective for disproving properties when the distribution of test data follows that of actual data [34], as we will see shortly. Hence, it is rarely the case that predefined generators are enough. So, the user has to combine them to match the specific needs of the properties under test. What is more, standard types are only a small fraction of those used in any program. Again, the user has to specify their random distribution, possibly taking advantage of the generators and combinators provided by the system. It is easy to imagine that random testing could be aimed without difficulties at any class of bugs provided that an adequate distribution of test cases is defined by the user, as the mechanism does not have any inherent limitations Test case generation In order to illustrate how test case generation works, let us begin with a simple example and progressively head towards more complex scenarios. The following specification of the reverse function has become the Hello World equivalent of specification-based frameworks. With some variants, such as making it range over lists instead of strings, it is very frequently used as introductory case in the few tutorials and papers that exist on the subject. Example 5.1 Property reverse specifies that reversing a string is equivalent to splitting the string into two, reversing the parts and putting them together in the inverse order they were split. property : reverse => [ String, String] do a, b (a + b). reverse == b. reverse + a. reverse end

40 28 Unit Testing Strategies We feed this example to the framework and employ random generation of test cases in order to find two instances of the class String such that the property is disproved, i.e evaluates to false. The framework uses the predefined generator of the standard class String, as we have indicated it to do so in the signature of the property. The generator yields strings of size up to 5. By default, all string sizes happen with equal probability, and each string position can be filled by any ASCII character with equal probability as well. After running a fixed number of cases we will elaborate more on the number of cases that are run before giving up in later sections and chapters the tool reports it has not found any that falsify it. This is correct since the property is sound. If we mistakenly define the second part of the predicate as: a.reverse + b.reverse, the tool finds a counterexample within the very first few cases, as it suffices that both strings of the pair (a, b) are non-empty and different to disprove the property. For instance, ("ab", "1") is the example shown in one execution. Note that "1Ba" is not equal to "Ba1". As we pointed out previously, the adequacy of the default generators is the exception, and not the norm. Consequently, it is important to get acquainted with the different facilities for defining custom ones as soon as possible. Nonetheless, it is quite difficult to define a property that illustrates such a scenario without having a codebase to formulate predicates against to. The following one is one of the simplest we could think of. Example 5.2 Property printf specifies that the printf function always returns nil, no matter what string we call it with. property : printf => String do s printf(s) == nil end The printf function in the underlying implementation language has the same semantics as in the well-known C Standard Library. That is, it takes a string, which may include format specifiers such as %d, and an optional number of parameters used in those format specifiers. The string is given format, printed to the standard output, and the value nil in our case is returned signaling there was no error. Our property is not sound. For instance, printf will not terminate correctly when called with "%d", as it will be missing an additional argument to be interpreted and printed as a decimal number. However, after feeding the property to the tool, no counterexamples are found and it terminates gracefully.

41 5.1 Test Case Generation 29 To see why, remember that the String default generator provides sequences of up to 5 characters, and all ASCII ones have equal probability of belonging to the generated String. Therefore, the probability of getting a format specifier is so low, that even after thousands of executions the property is still not falsified. For instance, the probability of getting "%d" in a String of size 2 is as low as (1/128) 2. Of course there are a few more format specifiers, and other control strings that can break printf, but even when taking them into account the probability is still too low to find a counterexample within one execution. Clearly we need to define a custom generator, and the usual way to do so is to declare a new type and make the property operate over that type. However, it seems like an overkill to define a new type just to change a minor aspect of the generator. That is why standard types provide a way to change common parameters of their generators through their of method. We find this is a very convenient feature not found in other frameworks. For instance, the String type allows to specify a frequency for the generation of its instances, dividing ASCII into 4 classes: alphabet, control, number and special chars. Remark 5.1 Standard types provide a facility for modifying the common parameters of their generators through the method of. Remark 5.2 In case of String, the method of allows the user to specify the frequency of each class of characters alphabet, control, number and special at each position. Example 5.3 printf property refactored using a customized generator defined by calling of. property : printf => String. of([1,2,1,5]) do s printf(s) == nil end With the above property definition, counterexamples are found fast. Alphabet characters only appear with a probability of 1/9 at each position. Control, number and special appear with probabilities of 2/9, 1/9 and 5/9, respectively. As the % char belongs to the last category, the probability of getting abnormal cases is much higher now. After a few cases the system reports that the String "a%s0" makes printf throw and ArgumentError. This is due to the fact that it expects an additional parameter after reading the format sequence "%s", but none were provided.

42 30 Unit Testing Strategies Next, we show how to do the proper by defining a new type. As said, in this case it is too cumbersome. However, in other cases it is necessary, as we may need to do something which is radically different from what the standard generator provides, and customization through parameters is not enough. Example 5.4 printf property refactored using a custom type. property : printf => PrintfString do s printf(s) == nil end class PrintfString < String def self. arbitrary alphabet = one_of(65..90, ) control = one_of(0..32, 177) #... frequency({ alphabet => 1, control => 2, number => 1, special => 5 }) end end The new type class inherits from String, although that is not necessary in this case. The generator method has to be named arbitrary following the convention of QuickCheck. random could have been more appropriate, but there can be a name clash with random number generators in some types. Remark 5.3 Generators are functions that return or yield after the keyword of the same name in CLU [46] values for performing an outer iteration. That is, generators serve for iterating over a collection of elements explicitly manipulating those elements. In contrast, (internal) iterators receive a block or closure and apply it to each element, but do not uncover this elements to the caller. As a small digression, in chapter 8 we will show a scenario where we need to transform internal iterators into external ones. It is possible to do so by means of continuations, a construct popularized by Scheme [66]. We use the one_of combinator to create a generator for each of the 4 classes we divide the ASCII alphabet into. This is not strictly necessary as String already provides us with those generators through the inheritance tree. However, we do so for illustrating the usage of this combinator. Finally, we use the frequency combinator to establish the random distribution of the aforementioned classes. Both combinators have also been modelled after QuickCheck [24].

43 5.1 Test Case Generation 31 Remark 5.4 Combinators are higher-order functions that define the return value by solely using function application and other combinators [26]. Remark 5.5 The combinator one_of takes a collection of values, combinators and types (classes). It produces a new generator that yields each one with equal probability. Remark 5.6 The combinator frequency takes map whose keys are frequencies and whose values are other values, combinators or types (classes). It produces a new generator that yields each one with the indicated rate over the total. A small but interesting detail is that calling one_of is equivalent to calling frequency with all rates equal. The advantage of one_of is that it provides convenient syntactic sugar for passing it collections such as arrays or intervals. For instance one_of(0..10) is equal to: frequency({ 0 => 1, 1 => 1, 2 => 1,..., 10 => 1 }) This explains how we elaborated the PrintfString arbitrary generator by building small generators for each interval of ASCII chars using the one_of combinator in the example 5.4. Generation process We are now in a good position to outline the test case generation process for random testing implicitly defined through all the examples. For each property the framework chooses the indicated generators. These are usually the ones linked to types classes indicated in the property signature. We have seen already that the generator attached to a type can be modified by using the of method for a particular property. Moreover, we can use generators different than those linked to the types by using a mechanism we will describe in chapter 8. The important idea now is that the indicated generators must respond to the message method call arbitrary by yielding one random value. Then for each test case: 1. Each of the generators is used once, by calling its arbitrary method which returns one random value. 2. All the values yielded by the individual generators are wrapped into a tuple and returned. The tuple is the test case.

44 32 Unit Testing Strategies Remark 5.7 If the framework detects there are no generators for a particular property and strategy random generators in this case it skips using that strategy for the current property gracefully Progress measurement Finally we present how to measure progress in a strategy-dependent way. We have had to make a little abstraction effort. Different specification based frameworks only have one strategy, and thus do not need to compare progress among different ones. This is useful when using all them simultaneously to see how far each of them is from achieving its own goals for the current property and act accordingly. That is, let the strategies that are behind to generate more cases. Each strategy has an effort measure, indicating how much of its test case space it has covered so far for the current property. Also, it has an arbitrary goal, indicating the minimum expected effort. Thus, we define for each strategy s that is generating cases for a property p the normalized testing progress. Definition 5.8 Normalized testing progress Normalized testing progress sp = Testing effort sp /Testing goal sp Remark 5.9 A normalized testing progress of 1 indicates that the strategy s has achieved the arbitrary goal for the current property p. Definition 5.10 Random testing effort The random testing effort is the number of test cases generated for the current property. Definition 5.11 Random testing goal The random testing goal is an arbitrary constant, representing a number of cases to be generated. Usually, in random generation one would like to exclude test cases from the effort measurement that result in the property to evaluating trivially to true. One wants to avoid getting a false sense of progress when, for instance, the property is an implication whose head is difficult to satisfy. As a result it may be always false, and the whole would evaluate trivially to true, yet we will see the progress indicator incremented. Due to the way the rest of strategies measure progress this is not necessary for them. Moreover, strategy-independent progress measurement also protects us from these scenarios. As we will see, we are able to identify that the head of an implication has never evaluated to true, for instance, so it is not that important.

45 5.1 Test Case Generation 33 Definition 5.12 Trivial case We define a case as trivial if the property has evaluated to true, the property contains at least one implication that has evaluated to false x, and switching the outcome of that implication to false makes the property false. Example 5.5 A property whose predicate evaluates to the following is trivial. true ((false true) true) Exhaustive testing Motivation In some cases, random testing may fail to disprove properties even though there are simple counterexamples. This is due to the fact that it may be difficult to define a random distribution for the generators that covers those values, or simply because the input space is too large. The approach taken by exhaustive testing is to order input values according to a size criteria, and generate all the values down to a depth of a certain size [61]. That is, generate all values of size 0, 1, and so on, down to the depth timing constraints allow us to reach. This may seem as a naïve strategy, but it is not. Model checking tools have successfully used the same method for a long time. They are capable of proving formally that relatively simple instances of a problem satisfy a given specification, but fail to terminate in complex scenarios. This technique is supported by the small scope hypothesis [37], which claims that most faults are located in simple cases. In other words, it a program fails to meet its specification, it almost always fails in some simple case. An alternative formulation says that if a program does not fail in any simple case, it hardly ever fails in any case. Exhaustive testing is, like random testing, a general-purpose approach in the sense that it can be applied to any type of software. Enumerating exhaustively all small values is a bit more complex, but this is no inherent restriction. An important point is that, unlike random testing, it is aimed only at simple faults. We cannot expect to enumerate all values up to complex cases even in simple scenarios.

46 34 Unit Testing Strategies Test case generation Let us recall the printf property originally presented in example 5.2 to illustrate how exhaustive test generation works, and its advantages. As the reader will remember, we needed to customize the generator of String, since the default one did not suffice for disproving the property. The probability of getting a format sequence among strings of size up to 5 was very low. However, this is not the case with exhaustive test generation. The signature of the property defines it as one that takes as input a String. Hence the framework calls the exhaustive 1 method of this class in order to get an iterator of all the elements of size n, which is provided as parameter. The size of each type is defined on a one-to-one basis, but in the case of String the obvious way to do it is by using its length. Initially, the tool requests an iterator for the strings of size 0. It only returns the empty String: ''. The property is not disproved as printf returns gracefully. Next, the tool gets an iterator for the strings of size 1, which yields 128 strings: "\0", "\1",..., "\127". None of this falsify the property, as there are no possible format sequences for only one character. Subsequently, the framework carries on iterating through the strings of size 2, which are This is a very reasonable number, but it is not even necessary to go through all them, as "%\1" breaks printf, falsifying the property. Note how the small scope hypothesis applies in this scenario. A simple case was sufficient to disprove the property, whereas in random testing we got a rather complex one: "a%sr". Apart from not having to define a custom generator, we have the advantage of getting the most simple case that disproves the property according to the ordering we define for the type. This is very important, as it helps to track down the source of the bug much more easily. Remark 5.13 Exhaustive testing always disproves the property with the most simple case available, that is, the case of lower size first presented by the iterator that evaluates to false. As it happened with random testing, in many situations the default generators iterators in this case are not suitable. Hence the user can build new ones, possibly by combining the existing iterator factories. The union and product combinators come handy. 1 Note the contrast with the arbitrary method used for random testing.

47 5.1 Test Case Generation 35 Remark 5.14 The union combinator takes any number of iterator factories and returns a new iterator factory that builds iterators which, for each size, yield the elements provided by the first and the second one. Remark 5.15 The product combinator takes any number iterator factories and returns a new iterator factory that builds iterators which yield pairs of elements from the first and second one, respectively. The iterator factory term is admittedly a bit pretentious. By iterator factory we mean any object that responds to the message exhaustive and returns an iterator of elements of the indicated size. We use such a term for stressing the fact that these methods build new objects choosing them from a class hierarchy at runtime. This is closely connected with the factory method design pattern [31]. Example 5.6 union(string, Symbol).exhaustive(1) iterates through all strings of size one: "\0", "\1",..., "\127" and all symbols of size one: :\1, :\2,..., :\127. It does so by calling String.exhaustive(1) and iterating through all the elements, and subsequently calling Symbol.exhaustive(1) and doing the same. Example 5.7 product(string, String).exhaustive(1) iterates through all possible pairs of strings of size length one. As in random testing, those new iterators need to be returned from a class method, exhaustive. In contrast, there is no possible customization through the of method. Generation process Just like in random testing, generators for each parameter of the property are normally provided by the types indicated in the property s signature. We will see later on a mechanism for indicating custom generators for each property without the necessity of creating a new ad hoc type, as we already indicated previously. But, again, this is a minor detail. The important fact is that each parameter has a generator of exhaustive values attached to it, which responds to the exhaustive message returning iterators of the indicated size. 1. All iterator factories linked to each parameter are combined using product. 2. The resulting object is requested to provide an iterator of elements of increasing sizes, starting by 0.

48 36 Unit Testing Strategies 3. For each size, every element returned by the resulting iterator is already a test case. Only when an iterator is exhausted, the next one is requested Progress measurement Definition 5.16 Exhaustive testing effort Exhaustive testing effort is the level of depth reached in the execution. That is, the maximum size of cases which were all run. Example 5.8 In the previous test generation section we reached depth one after checking all strings of size 0 and 1, but not all of size 2, as the property was disproved. Obviously effort does not have much relevance when a counterexample has been found. Definition 5.17 Exhaustive testing goal An arbitrary depth d. It follows that all cases of size s d must be generated to reach the goal. Note how the size of the elements to be iterated over may increase fast. In our previous example with ASCII strings, it is incremented by a factor of 128. So, although size strings is easily manageable, size strings could take a lot of time. Much more radical examples are very frequent. For instance, Unicode strings, with a factor of Remark 5.18 Due to combinatorial explosion, running all cases from a certain size may take few time, whereas running all from the following size may be unfeasible, or downright impossible Historical faults testing Motivation Regression testing is the practice of testing a program in search for faults introduced on features that previously worked correctly, which are known as regressions, hence the name. This practice usually took place only during the maintenance phase, but with the advent of the agile methods extreme programming and test-driven development regressions are tested through all the software life cycle. It is necessary to do so as development is performed in an iterative and incremental way. A set

49 5.1 Test Case Generation 37 of unit tests is created in parallel with the code, and run every time a change is made. Even though rerunning all test fixtures each time can be expensive once they grow up to a respectable size, common unit testing frameworks do not provide any mechanism for easing the task other than simply executing the cases that failed in the previous batch. This approach is hardly valid in our case, as test cases are implicit and have to be generated each time. Therefore, a property that found a bug in a previous execution which has been reintroduced may take too much time to be generated. Or even worse, it may not be generated after a few executions of the framework. If we think of the previous strategies as search algorithms for finding faults in the input space, it is reasonable to think that we can use information from previous executions to speed up to search process. The reader will now understand that we need to apply a more sophisticated policy, which will imply in all cases storing test cases that were found to falsify a property. Below, we propose a mechanism for scheduling the execution of those cases, such that the ones with a higher probability of finding bugs are run with a higher probability. But at the same time, we also ensure that all are eventually run. This mechanism is borrowed from regression testing techniques used to schedule unit tests in constrained environments [42], which in turn are inspired by statistical quality control common practices [38]. Note how our problem slightly differs from regression testing. We define historical faults as: Definition 5.19 Historical fault Any test case that was successfully used for disproving a property at some execution of the framework. Whereas regressions, already defined implicitly at the beginning of the section are: Definition 5.20 Regression Any test case that previously was not falsified and now it is. In other words, faults introduced in features that previously did not malfunction. Maintenance techniques are interested in those since their major problem is to control that minor changes introduced to software do not break the large

50 38 Unit Testing Strategies amount of existing functionality. Our problem is slightly different. We do care about features that do not work initially because we assume we are in an active development phase. Of course, this has the risk of making the tool store useless information if properties are checked too early. That is, when the implementation is still too primitive, almost any arbitrary test case can falsify the corresponding property. This is why we proposed an adequate development process to avoid such scenarios in section 4.3. However, if there is a strong desire to use the tool in a watch the tests fail fashion, it is not hard to imagine how the strategy could be changed in order to record regressions only. We work under the assumption that the number of historical faults may grow too big in order to be run in one step. Or alternatively, that each case takes enough time to be checked so that it is desirable to run the ones with a higher probability of finding bugs first. If both assumptions are not true, the method is an overkill, but we think it will be rarely the case. Still, it can be applied for scheduling the verification of whole properties Test case generation Each test case that falsifies a property is serialized and stored in a database 2. Along with the case, we keep a set of time-ordered observations of its executions H = {h 1, h 2,...}. For each execution of the property, if the case is used but the property is not falsified with it, h i = 0. Otherwise, it takes the value 1. Obviously, when it is stored it is due to the fact that it failed, so h 1 = 1 always. Now we define the probability of running a test as a weighted moving average of its previous executions. It is weighted so that the importance of execution history can be balanced. It is a moving average in order to refresh the probability of running the case for each new execution. P 0 = h 1, P k = α h k + (1 α) P k 1 Probabilities can be interpreted in its original meaning, or simply as a priority index in case we want to run all regressions each time, but we desire to order them in an intelligent manner. We use the second approach. α is a smoothing constant. With values close to 1, recent faults and test cases not retested quickly become the ones with higher priority. As a consequence, 2 As a design side note, it is possible to do so and still keep the framework lightweight by employing an embedded relational database management system such as SQLite [3]. These RDBMS operate as libraries, not as standalone processes, and store the whole schema in a single file.

51 5.1 Test Case Generation 39 we cycle through all cases in a few executions. Conversely, if α is close to 0, the pace is slower, and it takes longer to cycle through all historical faults. α should be set according to how fast we change code and how frequently we run the framework [42]. In order to avoid making such a decision, a fixed intermediate value of 0.6 should fit reasonably well all usages. We should note that it is unnecessary to store all observations h k and all probabilities P k. According to the equation above, we recompute the current probability by using the current observation and the previous probability. Thus, we can overwrite the previous probability with the new one after each run. Generation process The relation stored in the database is (property, case, probability), a 3-tuple. The generation process is very simple: 1. Select all cases for the current property, and order them by probability. Higher probabilities first. 2. Return the next test case of the collection Progress measurement Progress measurement is also conceptually straightforward. Definition 5.21 Historical faults testing effort The number of historical faults already generated for the current property. Definition 5.22 Historical faults testing goal A number of test cases to be generated for the current property. This number is calculated by multiplying a given arbitrary percentage by the number of cases stored in the database that correspond to the current property Human generated test cases Motivation As already outlined, humans have different abilities than machines. Therefore, they are in a good position to generate test cases that may have a very high probability of detecting faults. They can do so by applying different techniques.

52 40 Unit Testing Strategies First, they can use black-box techniques, such as the ones described in chapter 2. Boundary value testing is a good example of a technique humans can easily employ, whereas machines cannot, as it involves constraint solving over all possible types of a program, still a research problem. Second, they can also use white-box techniques, such as loop testing, for which humans are again very well suited. It is worth to point out that promising approaches for generating test cases by using program analysis techniques have been recently proposed [32]. We left them out since they were far beyond the scope of the thesis. Third, they can apply as well what it are known as grey-box techniques, which describe what at the end most developers do. That is, having a look at the internal structure of the program, but designing the test cases from a functional level Test case generation The philosophy of the framework is to automate unit testing as much as possible. Hence, this facility is provided with the intention that only few test cases with a clear purpose are provided by humans. Correspondingly, there are no features for structuring those cases. Nonetheless, as they are fed to the tool in a list, it is possible to use the facilities of the underlying language to mechanically generate an arbitrarily large collection of them. Example 5.9 Property reverse, presented in example 5.1 is rewritten below to provide some human-generated test cases. property : reverse => [ String, String] do predicate do a, b (a + b). reverse == b. reverse + a. reverse end always_check ['', ''], ['x', ''], ['', '1'] end We can see that the property has been refactored and now its body contains not only the predicate but also a list of strings provided by the user. As they are meant to be run every time the framework checks the property, the syntax always_check has been chosen.

53 5.1 Test Case Generation 41 The exact structure of the domain specific language used to specify properties should not be a major concern until later, but it is worth to point that properties need not to be rewritten to add human test cases. Since properties are first class elements in the underlying language, a simple method call would do the same job. The form shown above is just syntactic sugar. Nonetheless, we recommend it is followed, as we believe that having both the property and the human generated cases together is advisable. Remark 5.23 Invoked inside a property block or called on a property object, the method always_check takes as argument a list of lists of size equal to the arity of the property. It ensures that the cases indicated in each list are run every time the property is checked. For the sake of clarity, it is worth to mention that always_check is neither a combinator, nor a generator. It is just a method of the class Property that stores the provided cases inside each instance. It is the strategy itself which iterates through all them. Generation process From the framework side, the generation step is trivial. Just picking the different cases provided by the user. 1. Select the next case indicated by always_check not previously chosen. 2. Return this case Progress measurement Progress measurement is trivial too, and just consists on counting the number of supplied test cases run in comparison with the total. Definition 5.24 Human generated cases testing effort The number of provided cases already generated returned by the strategy. Definition 5.25 Human generated cases testing goal A prefixed number of cases less or equally than the amount supplied. Usually equal.

54 42 Unit Testing Strategies Genetic testing Motivation The test case generation strategies presented so far are aimed at different classes of bugs, but still none of them takes into account the internal structure of the property for assisting the generation process. Instead information comes from default or user-defined generators in random and exhaustive testing, old faults in historical testing, and directly from developers in case of human generated test cases. The approach is to keep on generating cases with those strategies hoping to reach the goal or adequacy criteria at some point. A reasonable one, as we will see in the next section, is to strive for property coverage. Dynamic test case generation strategies collect information about cases as they are run to refine them and try to achieve a certain goal [51], as we outlined in chapter 3. Thus, by employing one of those, we are no longer blindly generating cases expecting to ultimately achieve our desired goal. Instead, the strategy can actively try to pursue it. Among dynamic test case generation strategies, we have chosen the evolutive approach, and in particular genetic programming [6] since it is a very flexible and general. An interesting point is that genetic programming has been usually employed for generating white box test cases aiming at condition coverage. Nonetheless, we use it for functional or black box ones. The conversion step is immediate, since what is valid for condition coverage of common code is also valid for properties. Recall that properties are functions whose body is a single Boolean expression, and that we are using executable code for representing them Test case generation We will not provide an in-depth explanation of genetic algorithms here. For a gentle introduction we advise the reader to have a look at any of the classical references, such as [62]. The main idea is that genetic algorithms provide a very general mechanism for calculating approximations to optimization problems of non-linear functions. The question then is how to transform a test case generation problem into an optimization one.

55 5.1 Test Case Generation 43 The solution is quite simple, provide a fitness function that measures how far is the current test case from the goal. Goals must be Boolean values for each subexpression of the property. For instance, let us use the following property to illustrate this: property : for_genetic => [ Integer, Integer] do predicate do a,b a >= 5 & b <= 3 end end Clearly this is an unsound property, as there are values for which the predicate does not hold. However, our purpose is to show how genetic programming search works for properties. The same ideas will be also valid for arbitrarily complex properties. A possible goal could be to make the property true, making both subexpressions true. Note that we could also try to make the property false. Both alternatives are interesting. The first one, for achieving property coverage, which we will explain in the next section. The second, for disproving properties, which is the ultimate goal of the framework. Next, a fitness function has to be defined. The fitness function depends on the goal, and returns a value indicating how far we are from it. The algorithm tries to minimize the outcome of this function. In this case, we could define the fitness function as 5 a if 5 a and 0 otherwise, for the first subexpression. For the second one, b 3 if b 3 and 0 otherwise. The fitness function for the whole property would be the sum of the fitness functions for each subexpression. Fitness functions are built following the same schema for different types of subexpressions, so the task can be automated. For instance, numeric expressions of the form a b are turned into a b if a > b and 0 otherwise, when the desired outcome is true. Finally, we need to determine how to encode the inputs to the property, and how they are reproduced and mutated. In case of integers, the problem is trivial. Inputs are tuples of integers. Reproduction or crossover involves combining two possible inputs into a new one. In this case that is done joining the first element of one tuple with the second one of the other one. Mutations imply performing little changes to the solution. For numbers, a simple approach is to add or subtract small random quantities to the elements of the tuple. All this has its purpose, admittedly a bit unclear now. We try to provide some clarifications subsequently.

56 44 Unit Testing Strategies Generation process 1. An initial population or set of inputs to the property is built. If these inputs are not encoded in any custom way, the random generators of exhaustive testing could be reused. The size of the population is a parameter that influences the performance of the genetic algorithm [62]. 2. Among the elements of the population, those that are more apt according to the fitness function are selected. Some less apt are also chosen in order to prevent premature convergence. 3. A second generation of solutions is breeded by using the crossover function. This solutions are expected to share many good features of the parents. In order to prevent that the set of solutions becomes too homogeneous, some mutations are applied. 4. Steps 2 and 3 are repeated a fixed number of times. It should be noted that the genetic strategy is different from the rest in the sense that the generation process is tightly coupled to the execution of test cases. New cases cannot be created without running the previous ones. This is due to the fact that we need to know the outcome of the fitness function, which can only be obtained by running the property Progress measurement Progress measurement is also different in this strategy, thanks to the fact that it does not blindly generate test cases. A reasonable way of running the genetic strategy is having the goal of achieving property coverage. In short, that is evaluating each Boolean subexpression to all its possible outcomes that make the whole property true. We could then define: Definition 5.26 Genetic testing effort Number of outcomes of the different subexpressions for property coverage already satisfied by the different test cases. Definition 5.27 Genetic testing goal Total number of outcomes of the different subexpressions for property coverage. In other words, the testing goal is the sum of entries of the sets of the right column of table 5.2. The effort is the number of those entries satisfied by the generated test cases.

57 5.2 Independent Progress Measurement Independent Progress Measurement So far we have seen that each test generation strategy offers a different progress measurement mechanism, tightly coupled to it. This ways of accounting progress are valid in frameworks with a single strategy for generating test cases. However, when using several strategies, an independent and general facility is needed, as we already outlined in the previous chapter. On the one hand, it is not very unlikely that in the process of trying to disprove a property, one or more strategies take longer than expected to achieve a sufficient degree of progress according to their own way of measuring it. Hence, having one neutral way of determining whether enough test cases have already been evaluated is more than desirable. This would allow to move forward to another property even though some strategy has not achieved its goals based on its own progress measurements. On the other hand, in the scenario where all strategies finish promptly, one would like to know from a strategy-independent point of view if testing has been sufficient, or more tests cases need to be generated even though individual strategies think enough effort has been put in trying to disprove the current property. Let us introduce step by step different concepts that will lead to a good way of measuring progress from a black-box point of view. That is, by solely using the specification, which means that we will not examine the code being run. This excludes white-box techniques like code coverage. As explained in chapter 2 white-box techniques complement black-box ones. However, they were left out of the scope of this thesis. Definition 5.28 Boolean expression A Boolean expression is a expression that always evaluates to a Boolean value, that is true or false. Definition 5.29 Boolean expression coverage A Boolean expression E is covered by a set of environments if: E has been evaluated both to true and false, and all its Boolean subexpressions if any have also been covered. E is a Boolean literal, that is, E {true, false}.

58 46 Unit Testing Strategies As we will see soon, Boolean expression coverage is directly related to properties. Since the latter were defined as functions of any arity returning Boolean values, and having a single expression as their body; the relationship between both is easy to determine. It is worth to point that the previous definition is just a formalization of condition coverage for a single composite Boolean expression. This contrasts with the usage of all Boolean expressions of a program, like it is done in some white box testing approaches mentioned in chapter 2, to measure testing progress [16]. Example 5.10 The expression a 2 is covered if evaluated in the environments {(a = 1), (a = 3)}. Conversely, it is not covered if evaluated just for {(a = 2)}. Example 5.11 The expression a 2 b 3 is not covered if evaluated in the environments {(a = 1, b = 1), (a = 3, b = 4)}. Although the two comparisons have evaluated both to true and false, the whole expression has only evaluated to false. Definition 5.30 For each (sub)expression E, let R be a binary relation over environments, such that two environments are related iff E evaluates to the same value on both environments. R is an equivalence relation. Lemma 5.31 An expression E is covered iff it has been evaluated in sufficient environments, such that there is at least one environment belonging to each of the equivalence classes defined by the Boolean subexpressions of E with respect to the the previous relation R. Proof. Showing that the set of environments the expression needs to be evaluated in so that it is covered is the same as the set needed to have values in all equivalence classes is enough to prove the double implication. By the definition of R, the environments that make an expression true and false are equal to the two equivalence classes denoted by any Boolean expression, respectively. Example 5.12 Figure 5.1 depicts the equivalence classes of the expression used in the previous example. Each subexpression generates two equivalence classes, one for the values (environments) that make it true and the other for those that make it false. Since the top expression is a logical and, two additional classes are created: the intersection of the classes that make each subexpression true, and the rest. So a 2 generates two equivalence classes. The environments over and at the right hand side of the vertical line starting on 2, and those at the left side. b 3 generates another two equivalence classes, those environments below and on the horizontal line starting on 3, and the rest.

59 5.2 Independent Progress Measurement 47 b 3 2 a Figure 5.1: Equivalence classes generated by the expression on example 5.11 Note that even though there are values in all the equivalence classes of the subexpressions, there are not any in one of the classes of the top most expression. Or alternatively, the whole expression never evaluates to true. As a result, the expression is not covered. Definition 5.32 Property expression coverage A property expression is covered iff all its Boolean subexpressions have evaluated to all possible values that make the whole expression true. Remark 5.33 Covering an expression that belongs to a property involves using a subset of the environments needed for covering a Boolean expression as per definition This is because property expression coverage is less strict, and only requires to have evaluated the expression in environments that make it true. Example 5.13 If the expression a 2 b 3 is the body of a property, it is covered by just evaluating it in the environment {(a = 2, b = 2)}. Corollary 5.34 By lemma 5.31 and definition 5.32, a property expression is covered if it evaluates to values belonging to all possible equivalence classes defined by its subexpressions that still make the whole expression true. The link between property coverage and equivalence partitioning is quite evident as we have shown, yet it is completely absent from the literature. There is a reason for this. Equivalence partitions are used as a black-box testing technique to divide the inputs of a function into classes of elements as a way to avoid exhaustive testing. Trying one element from each class is considered enough.

60 48 Unit Testing Strategies On the other side, condition coverage is a white-box testing technique mainly applied in imperative code. As conditions can rarely be related to inputs in such code in a meaningful way for black-box testing, the relationship between two has never been established. However, in our case it serves well the purpose of showing that what intuitively seems to be a good way of ensuring that expressions are tested to some degree of exhaustiveness, is in reality equivalence partitioning, an acknowledged technique for functional testing. We still need to show a mechanism for calculating the values each subexpression needs to take in order to achieve property coverage, as the definition we provide is not constructive. The algorithm 1 computes a map whose keys are expressions and whose values are subsets of {true, false}. That is, it returns the Boolean values each expression should evaluate to in order to get the property covered. Initially it is called with the whole expression as parameter. Its second parameter is set to true by default. This is because the topmost expression is required to evaluate only to true. Otherwise, we would be requiring the property to be falsified to be covered, which is clearly contradictory. The algorithm creates a new map with the expression mapped to the provided values. Subsequently it calls the auxiliary getvalues function passing its operator and the provided values. This function is not described for the sake of brevity. It just looks for the provided operator in table 5.1 and returns the union of the sets of the third column that correspond to the values provided. These are the values the immediate subexpressions can take. Finally, the algorithm calls itself recursively once for each subexpression and returns a map made of the union of all results plus the one built initially. Algorithm 1 Property coverage table 1: function pctable(expression, values) 2: exprtable [expression values] 3: subvalues getvalues(operator, values) 4: subtable [ ] 5: for all subexpression in expression do 6: subtable subtable pctable(subexpression, subvalues) 7: end for 8: return exprtable subtable 9: end function

61 5.2 Independent Progress Measurement 49 Operator Value Subexpressions true {true} f alse {true, f alse} true {true, f alse} f alse {f alse} true {f alse} f alse {true} true {true} f alse {true, f alse} true {true, f alse} f alse {f alse} Table 5.1: Auxiliary table for getvalues function Example 5.14 Computing the property coverage table for a property whose expression is x p(x) q(x). The algorithm is initially called by providing the whole expression and {true} as parameters, since we want to compute all the possible values its subexpressions can take and still make the whole expression true. As per the second step, the whole expression is mapped to {true}. Next, the operator is applied along with the {true} set to the auxiliary getvalues function. This collects all the values that the expression inside the negation can take in order to make the negation {true}. The result is {false}. Subsequently, the algorithm calls itself recursively as many times as expressions inside the current one. In this case, only once, as the negation contains an existential quantification. The process is repeated until we reach a leaf Boolean expression, that is an expression that not composed of other Boolean expressions. In our case these are the p(x) and q(x) functions. Note that if those were properties, we could proceed further using their Boolean expressions in the algorithm, effectively calculating coverage for the resulting composed property. The result is easy to understand. In order to make the whole expression true, the existential quantification can only evaluate to false. Therefore, the conjunction must always evaluate to false too. Otherwise, it would make the existential

62 50 Unit Testing Strategies Expression x p(x) q(x) x p(x) q(x) p(x) q(x) p(x) q(x) Values {true} {false} {f alse} {true, f alse} {true, f alse} Table 5.2: Outcome of the algorithm after running the example expression true. Finally, Each component of the conjunction can evaluate to true or false, as this does not necessarily imply that the whole expression would become true. The resulting mapping of expressions to sets of Boolean values can be applied straightforward to measure testing progress. Each time the property is evaluated with a test case, the values of each subexpression are taken and removed from the corresponding entry of the above table. When the right column becomes empty, we have achieved the desired property coverage.

63 Chapter 6 Testing Object-Oriented Systems Entia non sunt multiplicanda præter necessitatem. Entities should not be multiplied beyond necessity. Ockham s razor, after William of Ockham This chapter describes how to apply the elements already presented to support object-oriented systems. This kind of software has certain properties that make inherently more difficult to test it. We summarize them in section 6.1. Our approach unifies contracts as in Design by Contract [49, 50] with properties, so that the former can be rewritten in terms of the latter. We formulate contracts as a special kind of properties whose main peculiarity is that they just range over one function or method to which they add runtime security checks. By not departing from properties, we avoid introducing additional concepts into the system. This has the enormous advantage of being able to apply all previous ideas to object-oriented software specification without any changes to them. Furthermore, we allow contracts to include assertions about the messages sent by each method, something not found in common Design by Contract implementations. These ideas are exposed in section 6.2.

64 52 Testing Object-Oriented Systems 6.1 Complexity of Object-Oriented Systems Although we think that object-oriented software is very valuable for solving problems in certain domains, we strongly believe that this paradigm is often used in place where functional programming would be much more adequate. This unsuitability stems from its poor engineering properties that become evident when trying to test it. Thus, it is worth understanding the issues that arise before attempting to extend the framework for supporting such systems Computation model The object-oriented paradigm has perpetuated many intrinsic defects from procedural programming, which in turn inherited them from the von Neumann computer. In short, its semantics is closely coupled to state transitions, its constructs are artificially divided between expressions and statements, and it lacks mathematical properties to reason about its programs. This deficiencies come from the underlying computing model, the von Neumann architecture which is represented by hardware, instead of being a pure abstraction, like lambda calculus. Backus characterized well the three main models behind most computing systems [9]: simple operational models like Turing machines, or automata; applicative models like lambda calculus, or functional programming systems; and Von Neumann models, like conventional programming languages, including object-oriented ones. The latter were labelled as having complex and bulky foundations, being history sensitive and having a semantics based on transitions with complex states. All this implies that object-oriented programs are not very clear, and thus difficult to test Engineering properties From a less abstract point of view, we can have a very brief look at some of the engineering properties of object-oriented systems. These are generally acknowledged to posses the following attributes: abstraction, encapsulation, modularity and hierarchy [19]. Let us focus on the last three. Encapsulation also known as information hiding denies access to private members of a class to external collaborators. Modularity breaks functionality into separate components [58]. However, both are severely compromised by hierarchy or inheritance [23]. Aside, referential transparency or the capacity to replace expressions by values [35] is rarely possible due to history sensitivity and identity of objects.

65 6.2 Contracts as Properties Contracts as Properties Design by Contract Design by Contract was introduced in the Eiffel programming language as a methodology for object-oriented software construction [50]. Its main goal is to increase the reliability of object-oriented systems, considered to be even more important than in procedural code due to their focus on reuse. Not surprisingly, the core of Design by Contract is the notion of contract, a systematic approach for specifying how to deal with abnormal cases. Contracts are usually explained by means of an analogy to business agreements [49], as we will see immediately. Preconditions and postconditions The caller of a certain method, like the client party in a business contract, has to fulfill some obligations. These are specified by the precondition section in the method s contract. The provider of a certain method, like the supplier party in a business contract, also has to fulfill some obligations. These are specified by the postcondition section in the method s contract. Obligations from one party are a benefits for the other, as they ensure they will not have to deal with exceptional cases. For example let us have a look at the typical contract of the square root function sqrt as it is implemented in most object-oriented languages. Party Obligation Benefit Caller Pass a non-negative argument Get the square root of the argument calculated Provider Return n such that n 2 = No need to deal with argument imaginary numbers Table 6.1: sqrt function contract Contracts, just like in the business world, protect both sides clearly stating how much should be done, and how little is acceptable.

66 54 Testing Object-Oriented Systems There are two major aspects that should be stressed. First, contracts have no hidden rules. Both parties know exactly what they are expected to provide and receive. Second, and more importantly, contracts have fail fast behaviour. If a precondition is not satisfied, then the method is not called, and the caller is notified. Conversely, if a postcondition is not satisfied, the method does not return and the provider is informed. Failing fast ensures that parties do not have to bother dealing with abnormal values, possibly with unpredictable consequences as they may not be totally prepared for them. This is why contracts have to be implemented using assertion facilities that verify pre- and postconditions, and break normal control flow by raising exceptions or the equivalent in the implementation language when they are not satisfied. Class invariants Pre- and postconditions can be applied to non-object-oriented software, that is, to functions or procedures. Nothing makes them dependent of objects. In fact, they are very well suited for specifying functional code, although there is not much tradition of doing this. For example, the sqrt function presented previously was not a method. However, when applying them systematically to specify object-oriented code one discovers that part of the postconditions predicates are always repeated. For instance, linked list implementations always maintain a counter of the number of elements currently in the list for the sake of efficiency. So, a linked list implementation would always verify that this index has not become inconsistent with the state of the object. As we can see, there is a part of postconditions that transcend them. It is called class invariant. Invariants ensure that the state of the class is perpetually consistent. Otherwise, the fail fast principle of Design by Contract comes to scene preventing further damage. The class invariant is checked when the construction of an object has finished, and after the execution of every public method. Furthermore, invariants are usually checked also before the execution of every public method to avoid running as much code as possible in case the class has reached an inconsistent state. This could happen if the contract assertions have been somehow bypassed.

67 6.2 Contracts as Properties 55 Contracts and inheritance We are still missing an important aspect before being able to formulate a definition of preconditions, postconditions and class invariants. We have not shown yet how inheritance affects these constructs. This is well summarized by the general rule in object-oriented software known as the Liskov substitution principle [45, 47]: What is wanted here is something like the following substitution property: If for each object o1 of type S there is an object o2 of type T such that for all programs P defined in terms of T, the behavior of P is unchanged when o1 is substituted for o2 then S is a subtype of T. More informally this means that we should always be able to use a subtype an instance of a subclass in place of any of its supertypes superclasses and the system should operate correctly. Therefore, if we want to use a subclass in place of a superclass, each of the redefined methods should have equal or less restrictive preconditions and equal or more restrictive postconditions. To see why, we recall a later and more refined definition of the Liskov substitution principle [48]: Let q(x) be a property provable about objects x of type T. Then q(y) should be true for objects y of type S where S is a subtype of T. Let us denote by q(x) the property of calling successfully method m with parameters p. If this property is provable for objects x of type T, then it should also hold for objects y of type S, being S a subtype of T, according to the principle stated above. However, if we make the precondition of method m more strong, it may reject being called with parameters p. Conversely, we could also build another property to demonstrate that a less restrictive postcondition may break the caller of the method m, as it would allow returning a superset of the expected values. Class invariants are also affected by inheritance, although due to more mundane reasons. Since attributes are expected to be hidden from clients, the Liskov substitution principle does not apply here. Instead, it is subclasses which break information hiding. As they may have access to the internal state inherited from the superclass, they can potentially change it violating the invariant of the superclass, which could break inherited but not redefined methods. Hence, invariants can only be made more strict.

68 56 Testing Object-Oriented Systems Design by Contract definitions After this, we are in a good situation to formally give the meaning of the three elements of Design by Contract. Definition 6.1 Precondition A precondition is a predicate defined for a particular method of a class. It ranges over the parameters of the method and the class attributes. The predicate, in conjunction with the preconditions of the same method defined in the superclasses, should always hold before running the body of the method. Definition 6.2 Postcondition A postcondition is a predicate defined for a particular method of a class. It ranges over the parameters of the method, the return value and the class attributes. The predicate, in disjunction with the postconditions of the same method defined in the superclasses, should always hold after running the body of the method. Definition 6.3 Class invariant A class invariant is a predicate defined for a particular class. It ranges over the attributes of that class. The predicate, in conjunction with the invariants of the superclasses, should always hold after the construction of class instances, and after running every public method of them. Remark 6.4 If we are in a non-object-oriented environment, references to superclasses should be ignored, and the class invariant definition should be disregarded altogether. The rest applies, as we explained previously Contracts as specialized properties Contracts look a lot like properties. Both are predicates Boolean valued expressions used to specify software. Like properties, contracts are executable and almost always written in the host language. Unlike properties, contracts are thought for defining the behaviour of just one method at runtime. Therefore, their arity is fixed and they normally cannot include calls to other methods in order to avoid circular reference between contracts. Moreover, they change the semantics of the function or method specified, breaking normal control flow if the contract is not satisfied.

69 6.2 Contracts as Properties 57 Intuitively, differences do not seem to be in the essential parts, as both constructs are used for specifying software using runnable predicates. Hence, it seems a very reasonable idea to attempt unifying both. The benefits from this approach are twofold. On the one hand, we can use contracts for generating test cases at developmenttime in the same way we employ properties. On the other hand, we can still use them for securing the software at runtime. A major design aspect is to avoid introducing a new entity in the system, and thus having to deal with it in a different manner than properties. If we think in object-oriented terms, this implies modelling Contract as a subtype subclass of Property and adhering the Liskov substitution principle we quoted previously for completely different reasons. That is, it should be possible to use Contract instances in place of Property instances without noticing. In practice, this means that all strategies previously presented could be applied to contracts with no changes at all. From a language designer point of view, we would like to follow what is know as the kernel language approach [60]. We have introduced a kernel language for writing properties, and we desire to introduce contracts without the necessity of new basic constructs. So, let us illustrate how we would do it with a basic case. Example 6.1 Simple class invariant and contract. class Counter invariant > } #... contract : add => Integer do requires { + i } ensures { == r == old. count + i } end def add(i) #... end end

70 58 Testing Object-Oriented Systems The Counter class creates instances that maintain an internal counting attribute called count, that for some reason has to be greater or equal than a lower bound defined for each instance when they are built. This fact is captured by the class invariant. The method add increments the internal counter by the provided amount i. We do not care how this is done, that is why the method implementation is omitted. All we care about is that we have to provide an amount i that when added to the internal counter has to be still equal or greater than low, in order to satisfy the class invariant. This is specified by the precondition of the add contract, which is labelled requires, after Eiffel s syntax [50]. The postcondition labelled ensures again following Eiffel s syntax guarantees that upon the execution add, the count attribute will be incremented by i and this value will be returned. Note that we need to refer to the state of the class prior to the execution of the method in order to formulate this predicate. This can be done through the old method. Remark 6.5 The arity of a contract is the arity of the method being specified. Therefore, its signature should contain as many types. Remark 6.6 The arity of a precondition is the same as the arity of the contract. Remark 6.7 The arity of a postcondition is the arity of the method specified plus one, for the return value. Remark 6.8 In the postcondition, the state of the class prior to the execution of the method can be accessed through the method old. First, let us concentrate on how the contract changes the semantics of method add. This is defined by the pseudocode shown below. The precondition and the class invariant are checked before the execution of the method. If they do not hold, an exception is thrown. Next, the original method body is run. Finally, the postcondition and the invariant are verified. If they hold, the value to which the method body evaluated is returned. If not, an exception is thrown. Note how exceptions are used for implementing the fail fast behaviour of Design by Contract, as we anticipated. Also note that the original method has to be somehow replaced in order to inject the contract assertions. This can be done either using meta-programming or aspect-oriented techniques, as we will see later on.

71 6.2 Contracts as Properties 59 def add(i) + i raise PreconditionException. new end r = # run method body == r == old. count + i raise PostcondtionException. new end return r end Listing 6.1: Pseudocode for contract checking Now, let us think what property does each contract describe. In order to define a property, we need to provide a predicate of a certain arity along with a list of input types. The difficult point is to realize that the arity should be that of the method plus one. This is due to the fact that we need an object to call the method on inside the predicate. Once we have realized of that, the predicate is not difficult to write down. We just need to evaluate the parameters to the method which are the parameters to the property except for one in the scope of the object. This little detail is crucial. It makes possible to reuse the same predicate for defining runtime contracts and properties. Moreover, it makes possible to access private members of the class. This is done in the pseudocode that follows using instance_exec, which evaluates a given block in the environment of an object. If the precondition holds, we just need to call the method. Note how we need to evaluate the postcondition in order to get the property coverage information. We protect the whole code against the exceptions thrown by a postcondition violation. In case the exception is thrown we gracefully return false. Since we run properties, contracts are activated and vigilant. Developmenttime is also runtime. This has the advantage of making properties automatically composed with that of the collaborators of the object. Exceptions will be thrown as well when their contracts do not hold, even if we are checking properties and not running code. Remark 6.9 It is important to maintain contracts turned on during property check to detect faults in collaborators of the object.

72 60 Testing Object-Oriented Systems Checking preconditions is necessary in order to avoid running the method without satisfying the caller part of the contract. We could capture and distinguish the exceptions thrown by preconditions from the rest, but this is a bit ugly and inefficient. property :' Counter# add' => [ Counter, Integer] do c, i begin (c. instance_exec(i) { + i }). implies c. instance_exec(i, c. add(i)) do i, == r == old. count + i end rescue PostconditionException false end end Listing 6.2: Pseudocode for translation into property Using this method we are loosing property coverage information of contracts run indirectly, that is, those that belong to methods called by the method of the contract under test. As an extension to our approach, we could perhaps establish two modes and force the call of those properties and record the coverage information when we know we are checking and not running real code Pitfalls With the approach just exposed, objects are interpreted as state machines whose methods return values and alter the object state, depending on the parameters received and the previous state. This is captured by the corresponding contracts. So, apparently object-oriented code can be specified and tested as easily as functional code. Far from reality. There are some inherent difficulties that stem from the computational model underlying object-oriented software and its poor engineering qualities we saw in section 6.1. Side effects Object-oriented code relies on side effects for everything. In contrast, functional programming is mostly side-effect free, except for certain operations like I/O. Nonetheless, even those operations are usually well isolated, either manually or by using constructs like monads [53].

73 6.2 Contracts as Properties 61 Hence, it is easy to unit test functions, as the output only depends on the inputs and there are no side-effects derived from calling them. Specifications are simple. Of course, it is possible to get them wrong, but you cannot forget any element. Objects make it much harder. The output of a certain method depends not only on the input but also on the previous state of the object and its collaborators, as said previously. Furthermore, the side-effects of the method can include changing the object state and other objects through sent messages. Including those elements in the specification is only up to the discipline of the programmer. Messages Note how we said the side effects of a method call include not only changing the state of the class, but also changing the state of its collaborators. For instance, let us imagine that the previous Counter class is used by some arbitrary class CounterClient to delegate the responsibility of counting: class CounterClient def increment #... end end Writing the contract for increment poses an interesting question. From the part of the method we are concerned with, there are no preconditions. We should care about not sending any values to counter that could underflow it, but clearly this cannot happen given the use we make of it. What about the postcondition? Should we make sure that the counter has been incremented? No. The Counter class has a contract which ensures that it will be correctly updated upon the call of add. Therefore, duplicating this specification is an error. Design by Contract tools would not go beyond that although we argue it is a mistake. The specification should definitely contain as a postcondition that a message to the Counter instance has been sent, as this affects the state of the program. Then, the Counter contract will do the rest. postcondition { r must_send(: add, counter, [1]) }

74 62 Testing Object-Oriented Systems We provide a facility that intercepts and stores all messages sent by each method so that predicates of contracts can be formulated on them. Remark 6.10 Through must_send postconditions can make assertions about the messages that were sent by the method during its execution. Message passing is an essential feature of object-oriented systems, yet Design by Contract tools do not provide facilities that allow writing assertions on messages. This makes it hard to specify object-oriented code in some scenarios. Sometimes, one has to choose between writing well separated but incomplete specifications, or complete but entangled ones. The sooner would imply omitting any mention to the side effects in the collaborators like counter in the previous postcondition whereas the latter would imply checking all side effects on collaborators, like asserting that counter has been incremented by one. Including messages in the postconditions makes it possible to delegate specifications in the same way than the behaviour.

75 Chapter 7 Architecture Elegance is not a dispensable luxury but a quality that decides between success and failure. Computing Science: Achievements and Challenges Edsger W. Dijkstra Throughout the following pages we give grounds for two apparently rather mundane choices in relation to the current problem, which at the end turn out to be more interesting that expected. First, in section 7.1, we the discuss the selection of an implementation language to build a prototype that demonstrates the ideas presented on previous chapters. The different features we need are outlined, to then justify the adoption of Ruby. A particularly important aspect behind this decision are the convenient facilities provided by this language for domain-specific language construction. Second, in section 7.2, we describe a possible arrangement of the different components of the system to be built. Our goal is to make the system as modular as possible, so that it in the future it can evolve from a prototype to a real framework if desired with relative easiness.

76 64 Architecture 7.1 Implementation Language The way properties are encoded, is in our opinion a major decision that bears a large share of responsibility in the practical success or failure of any unit testing framework. As explained in chapter 4, we strongly believe that a tool built for unit testing general purpose software should let programmers write specifications using the host language. The detailed justification was already given there, but it is well worth to remember the reader how cumbersome it is to maintain a codebase written in two different languages one for the specification and another for the implementation in a typical environment where requirements are volatile and safety is not a critical issue. Disadvantages are likely to outnumber the benefits of this approach. An apparently opposing argument to the usage of a native language for writing properties is the desirable documenting value of specifications. Proponents of test-driven methodologies have been advocating the usefulness of tests for this purpose since nearly a decade ago now [13]. Literate programming has been around much longer supporting somewhat equivalent ideas [43]. Properties are higher-level and more concise constructs than common unit tests, and thus have a greater documenting potential. However, this potential could be diminished by the usage of the host language instead of a more concise, tailored, domain specific one. An ideal solution is to create a domain-specific language (DSL) within the host language so that we keep both advantages. Few programming languages are well suited for this task. In particular, homoiconic ones stand apart. These treat the code as data, so building new syntactic abstractions is the straightforward way of implementing everything [5]. Examples of homoiconic languages include the Lisp family, Prolog or Forth. However, employing a homoiconic language would mean we have not demonstrated that building a DSL for properties with a common one is certainly possible. Therefore, we have left them out of the equation. Few remaining languages that do not treat code as data allow easy DSL construction. They need to provide flexible syntax and, in particular, first-class functions. It is difficult to explain shortly why, but the examples given in chapter 8 may be clarifying. We can recall a small number of languages that satisfy the above requirements, e.g. Ruby [69] or Scala [55].

77 7.2 System Architecture 65 Additional requirements are the support of the object-oriented paradigm for obvious reasons, as we want to demonstrate our ideas regarding the elaboration of specifications for that kind of systems. Moreover, we need highly dynamic capabilities so that we can easily develop a prototype that instruments properties, intercepts messages between objects or changes the semantics of methods to add contracts. For these reasons we have preferred Ruby over the rest, as it has an excellent track record of allowing to build remarkably dynamic applications such as Ruby on Rails [68], and a very human readable syntax that will lead to a nice language for writing properties. 7.2 System Architecture As said in the introduction to the chapter, the goal of the system architecture is to enable an easy evolution of the prototype. For that purpose, it is of utter importance to define the system in terms of components with interfaces between them as simple as possible. We believe elegant systems should have a clear structure and be based on a small set of metaphors. Our metaphor is that properties to be checked are provided by the user and flow from one component to other. Each component does something to the properties. For instance, a runner component checks the properties and adds some information about the execution to them. Another component could take the output from the runner and elaborate a report with the details of the execution or draw some graphs in relation to property coverage. The best way to model such a structure is to use the pipes and filters pattern: The pipes and filters architectural pattern provides a structure for systems that process a stream of data. Each processing step is encapsulated in a filter component. Data is passed through pipes between adjacent filters. Recombining filters allows you to build families of related filters. [22] This philosophy of making each program [filter] do one thing well was first popularized by the Unix programming environment [40], and is the basis of its longevity and success. Of course, we do not need a piping system as complex as the one of Unix. For instance we do not have mechanisms for output redirection like the operators > and 2>, or more importantly, we do not support a continuous stream of data entering the system, just because it is not useful in our case.

78 66 Architecture However, we believe it is still very useful as pipes and filters stresses small interfaces between components. Moreover, it defines a very convenient DSL for setting up the framework in no time. An important consequence of using the pipe syntax taken from Unix is that what would be a complicated function composition becomes more linear. Therefore, it is possible to distinguish the parameters used for setting up each filter from the elements a list of properties flowing from one filter to another. For instance, in the following examples function composition is contrasted with piping. A list of properties is fed into a runner which applies some test generation strategies provided as parameters for trying to disprove them. Finally, a reporter transcripts the information gathered during the execution of the runner into a file. Reporter.new('a.txt', ComplexRunner.new(TextUI.new, RandomStrategy.new, PList[ Property[: a], Property[: b]])) PList[ Property[: a], Property[: b]] ComplexRunner. new( TextUI.new, RandomStrategy. new) Reporter.new('a.txt') All elements need to have the operator defined, in order to be link the following filter to itself. A default implementation is provided in the PipelineElement module. Furthermore, all need to respond to the output message. The idea is that when you call the output of the whole pipeline, you are calling the output of the last element. This forwards the call to the previous one, and so on until the control flow reaches the first filter. This one returns whatever to the second one, which in turn performs some processing on the input, and returns it to the third one. This continues until the output of the whole pipeline is returned to the user. Filters expect to get a list of properties as input, that is, an instance of the standard Ruby Array class. We provide PList, a wrapper for this class, which implements the and output methods, so that it can be placed directly in the pipeline, as shown in the previous example. Additionally, we provide two block wrappers Action and SAction to be used in the middle and at the beginning of the pipeline, respectively. The intention is that the user can place arbitrary code inside the pipeline without needing to write an ad hoc filter. This block wrappers can be built using the do! function, that decides which one to instantiate. That is, do! is a factory method [31].

79 Chapter 8 Design and Implementation It is possible to make buildings by stringing together patterns, in a rather loose way. A building made like this is an assembly of patterns. It is not dense. It is not profound. But it is also possible to put patterns together in such a way that many patterns overlap in the same physical space: the building is very dense; it has many meanings captured in a small space; and through its density, it becomes profound. A Pattern Language Christopher Alexander In this chapter we describe the relevant internal details of the different components of the framework we just outlined in the architecture. Our presentation follows a style first used in [15], and later on employed in a paper about the popular JUnit tool [10]. The idea is to explain the design by starting from scratch and applying patterns one after another. We think this approach is particularly well-suited for showing the main design decisions behind any framework.

80 68 Design and Implementation 8.1 Domain-Specific Language In first place, we show how to evolve a domain-specific language (DSL) with the desired features. Language extension Preconditions The system is going to have a language for specifying properties. The language should support arbitrarily complex expressions as well as multiple options for describing properties and contracts. Furthermore, it is expected to evolve in the future. Problem Creating a fully-featured language is a non-trivial task that involves going through all different compiler construction steps: defining an abstract syntax tree (AST), building a lexer, a parser, etc. Constraints Introducing a new language for expressing properties and contracts adds unnecessary complexity to the framework internals and to its usage, as it forces users to deal simultaneously with two different languages. Therefore, the language used for specifying properties and contracts should have the same syntax and semantics as the host language, Ruby in our case. Solution Extend the existing language within its syntactic and semantic framework to form a new domain-specific language (DSL). The extensions may include new data types, semantic elements or syntactic sugar [64]. Code written in the new DSL can be embedded in source code of the existing language with no additional preprocessors. Ruby DSL Figure 8.1: Language extension

81 8.1 Domain-Specific Language 69 Instantiation Preconditions and contracts. The DSL has to offer basic facilities for creating properties Problem How should the elementary parts of the DSL be modelled? Constraints A naïve design may lead to a DSL difficult to extend or use, particularly from other code. Solution As a first step, the DSL is simply some methods on objects, instances of Property or Contract. Since objects are first-class citizens of the language, these will always be easily available for being called by humans and, especially, by other parts of the framework or other libraries, even if we introduce some syntactic sugar later on. Blocks or closures provide a good mechanism for encapsulating predicates written by the user. Example 8.1 reverse property written using basic instantiation syntax. p = Property. new(: reverse, [ String, String]) do a, b (a + b). reverse == a. reverse + b. reverse end p.always_check(['', ''], ['x', ''], ['', '1']) p. description(' Example property') Example 8.2 Math.sqrt contract written using basic instantiation syntax. precondition = lambda { n n >= 0 } postcondition = lambda { n, r (r ** 2 - n). abs < 1e-5 } method = Math. method(: sqrt) c = Contract. new( method, [ Float], precondition, postcondition) c. description(' Example contract')

82 70 Design and Implementation Sandbox Preconditions The elements manipulated by the DSL are objects. Problem Having to build a Property by hand, storing the created object and calling its methods one by one can be too cumbersome for systematic use. Moreover, it obscures the most important part of the property its signature and the predicate among a lot of boilerplate code. The same applies to Contract. Constraints If some syntactic sugar is introduced, it should be compatible with present features, as well as with unknown future additions to the DSL. Solution Property and Contract constructors accept a block of arity 0. It is interpreted as a sequence of commands to setup the object. Hence, it is evaluated in the scope of the instance. This makes unnecessary to store a reference to the object and call its methods one after another. Compatibility with upcoming features is ensured, as it works for all instance methods and for arbitrary code. Furthermore, it is still valid to build properties by passing directly the predicate as a block, since only closures of arity 0 are processed this way. Example 8.3 reverse property refactored using sandboxing. Property.new(:reverse, [String, String]) do predicate do a,b (a + b). reverse == a. reverse + b. reverse end always_check(['', ''], ['x', ''], ['', '1']) description(' Example property') end Example 8.4 Math.sqrt contract refactored using sandboxing. Contract. new( Math. method(: sqrt), [ Float]) do requires({ n n >= 0 }) ensures({ n, r (r ** 2 - n). abs < 1e-5 }) description(' Example contract') end

83 8.1 Domain-Specific Language 71 Top level methods Preconditions Thanks to sandboxing properties and contracts can be defined in one single call. Problem The DSL still looks like common code. It would be desirable to make it more streamlined. Constraints All changes introduced should preserve the possibility of calling the DSL mechanically from other libraries and introducing new features. Solution Creating some toplevel methods or functions can help to hide explicit calls to the constructors of Property and Contract. Furthermore, making use of Hash, it is possible to remove parameter lists from the headers. Additionally, parenthesis can be removed from method calls provided that some rules are followed in the method definition. Example 8.5 reverse property refactored using top level methods. desc ' Example property' property : reverse => [ String, String] do predicate do a,b (a + b). reverse == a. reverse + b. reverse end always_check ['', ''], ['x', ''], ['', '1'] end Example 8.6 Math.sqrt contract refactored using top level methods 1. module Math desc ' Example contract' contract : sqrt => Float do requires { n n >= 0 } ensures { n, r (r ** 2 - n). abs < 1e-5 } end end 1 The fact that Math is a module and not a class should not be a cause of great concern here. In Ruby, Module is the superclass of Class. Unlike classes, modules cannot have instances. Hence, it makes a lot of sense to use one of them for placing utility methods. Inside modules contracts behave like they would do for free functions.

84 72 Design and Implementation 8.2 Properties and Contracts Basic structure We have just seen how to develop a human-readable specification language by stepwise application of DSL patterns. It is worth to note that the more primitive forms shown at the beginning are still valid. This is useful for calling them programmatically from other code, as their syntax is more regular. Now let us define the structure of the objects created when using that language. We have in fact already sketched a large portion of them, as all the syntax that appears inside blocks is nothing more than instance methods of Property and Contract. We already outlined this when explaining the sandbox pattern. It has the enormous advantage of making any new methods automatically available in the DSL. Property has two attributes: types and predicate, with the obvious meaning. predicate stores the closure that contains the Boolean expression which is the core of the property. types is the list of input types to the predicate provided by the user. All these are provided during the construction of objects. Contract is a special type of Property for the reasons already explained, and therefore is its subclass. It has three additional attributes: method, precondition and postcondition. method is a Method object indicating the one specified by the contract. precondition and postcondition have the same meaning as in design by contract, and serve for deriving the predicate as explained in chapter 6. Furthermore, during the construction process of Contract the semantics of the referred method needs to be changed. The fail fast behaviour of design by contract is added so that when preconditions and postconditions are not satisfied, the normal control flow is interrupted. This can be done by adding both a before and after aspect capturing the logic indicated in chapter 6 [41]. In a dynamic language, aspects can be easily implemented by replacing the original method with a new one, which calls the old one. Nonetheless, we have used a library that does this task for us mechanically. Additionally, the predicate of each Property instance is instrumented during the construction so that after its evaluation we can know the outcome of different subexpressions. Also, all the outcomes each subexpression of each property should evaluate to in order to be covered are stored in a CoverTable object. These are computed as indicated in algorithm 1.

85 8.2 Properties and Contracts 73 Property namespace Properties and contracts are objects with the structure indi- Preconditions cated above. Problem We need a mechanism for uniquely identifying them so that properties can be called inside other properties for composition, and information about them can be easily stored and retrieved inside a database, which we will presumably use for implementing the historic faults strategy. Constraints Object identity is difficult to use. Each time files are loaded different Property objects are instantiated for the same declaration statement. Ruby does not have good serialization facilities that make storing objects in an image a practical solution to circumvent this problem. Furthermore, the user needs a simple mechanism for feeding properties to the system. Keeping track of objects manually is too cumbersome. Solution We included keys in the DSL precisely for this reason. Each key provided maps to the corresponding Property in a global namespace, and vice versa. Keys of contracts are calculated by concatenating the class and the method they refer to. A typical solution would be to use a singleton [31] for implementing the namespace. However, Ruby s classes are instances of the class Class. Therefore, using class methods is enough, as they do not have any remarkable restrictions. They are first-class citizens of the language. The namespace is accessed through the class method []. Moreover, properties can be evaluated directly by calling the class method named as the property and providing its arguments. This is implemented using the method_missing hook. Property - predicate: Proc - types: Array + [](key: Object): Object CoverTable Contract - precondition: Proc - postcondition: Proc - method: Method Figure 8.2: Property, Contract and its main members

86 74 Design and Implementation 8.3 Runner The Runner is the main component of the pipeline defined by the framework, explained in chapter 7. It takes as input a list of properties and tries to falsify them by finding appropriate test cases. The interface necessary to be placed in the pipeline is satisfied by providing the and output methods. UI objects subscribe to the Runner in order to get notifications about new test cases run, properties falsified, etc; so that the user receives feedback, either interactively or in batch mode. This has been done following the observer pattern [31]. The method add_observer is used for registering UI instances, whereas notify sends updates back. We have built two different UIs. One the one hand, a simple BatchUI that writes events to any IO object, including the standard output, on a line to line basis. On the other hand, an interactive TextUI that displays progress like a GUI but in text mode, like the Unix top utility would do, for instance. We have chosen to implement such an interface, and not a graphical one since it can be reused easily within most Ruby development environments, which are predominantly text-mode, e.g. Emacs. Since it makes no sense at all to execute the Runner without any UI at all, it instantiates an subscribes one to itself when there are no observers registered. An interesting problem is to choose an adequate UI since the interactive TextUI may malfunction when called inside the interactive Ruby interpreter (irb). This is due to the fact that irb is already an interactive text application, and for technical reasons running an interactive text application over another one may cause the terminal to stop working correctly. Through a factory method [31], which is the constructor of the abstract class UI, a decision is made according to the current environment of execution. Furthermore, the abstract UI provides a basic handling logic of the different events, but requires its concrete subclasses to implement some abstract methods like next or step in order to provide a definition of how should those elements be translated into notifications to the end user. Two different Runner implementations have been built. A SimpleRunner which solely runs the human generated test cases, and a ComplexRunner, which as it name indicates, has a more elaborated execution. Since both runners check properties one by one, some of its execution algorithm has been abstracted and placed into the SequentialRunner class by using a template method [31], which delegates some concrete implementation details to the subclasses. In particular, it asks them to indicate how each Property is checked through the check_property method.

87 8.4 Test Case Generation Strategies 75 The ComplexRunner has several different test case generation strategies those described in chapter 5 and deals with them though a common interface, as we will see in the following section. For each Property, it asks each strategy to generate test cases if possible. It moves on to the next property when it finds a case that falsifies the property, or the terminating condition is fulfilled. This is by default true when a certain time has passed, or when less time has passed but property coverage has been achieved. Properties are decorated [31] with information about the execution so that other components placed after the Runner in the pipeline have the opportunity to take advantage of this. For instance, a reporter may be built to create logs pretty printing each property and outlining if it was falsified or not, a typical feature found in popular testing frameworks. An ErrorDatabase object provides access to a lightweight database [3], hiding the details of the relation schema. It is used for storing failed test cases by the Runner, as well as for retrieving that information by the historic faults test case generation strategy. 8.4 Test Case Generation Strategies Template method Precondition The ComplexRunner will have all different test generation strategies as aggregates. Problem Strategies have very different implementations, but at the same time they share some similarities. For instance, they are only capable of generating cases when a property has been defined previously. Constraints The ComplexRunner would like to treat all strategies in the same way, through a common interface, without paying attention to their peculiarities. Solution Define a common interface for all strategies in an abstract class Strategy. This common interface hides their differences to callers. Each method of the interface is partially implemented and leaves concrete details to the subclasses. It is a template method [31]. The idea is that strategies do not have to bother dealing with exceptional cases. All this is done for them

88 76 Design and Implementation BatchUI + new(output: IO) - properties(n: Integer) - next(p: Property) - step - success - failure(cause: String) 1 ScrollBar UI + new: UI + notify(update: Object) - properties(n: Integer) - next(p: Property) - step - success - failure(cause: String) 1..* TextUI - properties(n: Integer) - next(p: Property) - step - success - failure(cause: String) 1 ProgressBar Runner + (filter: Filter) + output: PList + add_observer(o: Observer) - check(p: PList) SequentialRunner - check(p: PList) - check_property(p: Property) SimpleRunner - check_property(p: Property) ComplexRunner - check_property(p: Property) ErrorDatabase + new(file: File, alpha: Float) + insert_success(p: Property, tc: TestCase) + insert_error(p: Property, tc: TestCase) + update_property(p: Property) + get_cases(p: Property) 1..* Strategy Figure 8.3: Runner hierarchy and collaborators

89 8.4 Test Case Generation Strategies 77 Strategy + set_property(p: Property) + generate: TestCase + exhausted?: Boolean + progress: Float - gen: TestCase - can: Boolean - exh: Boolean HistoricStrategy - gen: TestCase - can: Boolean - exh: Boolean RandomStrategy - gen: TestCase - can: Boolean - exh: Boolean ExhaustiveStrategy - gen: TestCase - can: Boolean - exh: Boolean HumanStrategy - gen: TestCase - can: Boolean - exh: Boolean ErrorDatabase Figure 8.4: Strategy hierarchy once in the superclass. The common interface defines methods for setting the current property through set_property, generating a test case by calling generate, indicating if the strategy can generate more cases (exhausted?) and providing a measure of progress (progress). In turn, concrete strategies need to define three private methods to complete the template methods defined by the public functions of Strategy. gen simply returns a test case, but it does not have to deal with any exceptional cases, which are managed in the template method of its superclass. can indicates whether the current Property is apt for generating any cases at all, i.e. it has the proper infrastructucture defined. Finally, exh indicates whether the Strategy can generate more cases. Again it does so without bothering about exceptional situations, which are dealt with in its parent. Generators and iterators As already explained, generators for random test case construction need to respond to the message arbitrary by yielding a random value. The corresponding combinators, frequency and one_of just instantiate a FrequencyCombinator object which takes a list of generators. Values not responding to arbitrary are wrapped so that combinators can deal with all elements in a composite fashion [31].

Topics in Software Testing

Dependable Software Systems Topics in Software Testing Material drawn from [Beizer, Sommerville] Software Testing Software testing is a critical element of software quality assurance and represents the