Improving the Applicability of Object-Oriented Class Cohesion Metrics

Size: px

Start display at page:

Download "Improving the Applicability of Object-Oriented Class Cohesion Metrics"

Brittany Lawrence
5 years ago
Views:

1 Improving the Applicability of Object-Oriented Class Cohesion Metrics Jehad Al Dallal Department of Information Science Kuwait University P.O. Box 5969, Safat 13060, Kuwait Abstract Context: Class cohesion is an important object-oriented quality attribute. It refers to the degree of relatedness between the methods and attributes of a class. Several metrics have been proposed to measure the extent to which the class members are related. Most of these metrics have undefined values for a relatively high percentage of classes, which limits their applicability. The classes that have undefined values lack methods, attributes, or parameter types, or they include only a single method. Objective: We improve the applicability of the class cohesion metrics by defining their values for such special classes. In addition, we theoretically and empirically validate the improved metrics. Method: We theoretically examine whether the defined values satisfy the key cohesion properties. In addition, we empirically validate the metrics before and after the improvements to test whether the defined values improve the ability of the metrics to evaluate class cohesion. We also explore the correlation between the metrics and the presence of faulty classes to indirectly determine the strength or weakness of the metrics in indicating class quality. Results: The results show that our assigned values for the undefined cases do not violate the key cohesion properties and considerably improve the ability of the metrics to explain the presence of faulty classes and may therefore improve their ability to indicate the quality of the class design. Conclusions: Having the class cohesion metrics defined for all possible cases improves the applicability of the metrics and potentially increases their precision in indicating class quality. Keywords: metric applicability, object-oriented software quality, object-oriented class cohesion, fault prediction. 1. Introduction Software engineering aims to develop high quality software. Several quality internal attributes are considered during software design phase, including cohesion, coupling, and complexity. All of these quality attributes are important and they all must be carefully considered during the software development phases. Cohesion refers to the extent to which the components in a software module are related. Developers and researchers pay attention to cohesion because they believe that highly cohesive modules are more understandable, modifiable, maintainable, and reusable than less cohesive modules (Briand et al. 2001a; Chae et al. 2000). The advantages supported by the object-oriented paradigm depend on the class notion; therefore, the class design plays a key role in the 1

2 overall quality of the object-oriented software (Chae et al. 2000). The basic unit in objectoriented software is the class, which consists of methods and attributes. A poorly designed class has unrelated methods and attributes. Class cohesion metrics measure the relatedness of the methods and attributes in a class. Although there is no upper limit to the number of attributes and methods in a class, developers are encouraged not to develop large classes, to increase the understandability, manageability, and maintainability of their code. In previous empirical studies (e.g., Makela and Leppanen 2007, Chae et al. 2000), it was reported that relatively large portions of object-oriented systems are composed of small classes that consist of a single attribute or a single method. Unfortunately, most class cohesion metrics are undefined for one or more of these cases, i.e., cases in which a class has no attributes or parameter types, or consists of a single attribute or method, which limits the applicability and usability of these metrics for a relatively large percentage of classes. For instance, the class cohesion metrics proposed by Chidamber and Kemerer (1991, 1994), Bieman and Kang (1995), Henderson-Sellers (1996), Bansiya et al. (1999), Badri (2004), Bonja and Kidanmariam (2006), Fernandez and Pena (2006), and Counsell et al. (2006) are undefined for classes consisting of fewer than two methods. In addition, the cohesion metrics proposed by Henderson-Sellers (1996), Briand et al. (1998), Fernandez and Pena (2006), and Counsell et al. (2006) are undefined for classes that do not contain any attributes. Finally, one of the metrics proposed by Counsell et al. (2006) is inapplicable for classes in which none of the methods has parameters. Limiting the applicability of these metrics to certain classes may well discourage software developers from applying such metrics to assess the quality of their classes. In addition, such limitation can result in unrepresentative overall quality indicator values for the system under consideration, which in turn may decrease confidence in the overall quality of the system because the developers may have no knowledge regarding the cohesion of a large portion of the system. Several empirical validation studies, including Briand et al. (1998), Briand et al. (2001), Gyimothy et al. (2005), Aggarwal et al. (2007), Marcus et al. (2008), Al Dallal and Briand (2010a), Al Dallal and Briand (2010b), and Al Dallal (2011a), excluded all classes that feature cases for which the considered metrics are undefined. Yet, if the metrics have defined values for these classes, including them in the empirical studies could greatly alter the validation results, either positively or negatively because the percentage of classes that feature cases for which the considered metrics are undefined is typically high in object-oriented systems. This limitation suggests a need to assign cohesion values to special cases in which some of the metrics are undefined. These assigned values must be theoretically and empirically validated to verify that they do not cause the metrics to violate key cohesion properties (e.g., the properties proposed by Briand et al. 1998) and to increase the confidence that these assigned values enhance the ability of the metrics to properly indicate cohesion. In this paper, we propose values for twelve cohesion metrics including the ones proposed by Chidamber and Kemerer (1991, 1994), Bieman and Kang (1995), Henderson-Sellers (1996), Briand et al. (1998), Bansiya et al. (1999), Badri (2004), Bonja and Kidanmariam (2006), Fernandez and Pena (2006), and Counsell et al. (2006) for the cases in which they 2

3 are undefined, thereby making the metrics appropriate for all possible cases. We define criteria for assigning and justifying these values. In addition, we confirm that these values do not cause the metrics to violate the widely-used and accepted key cohesion properties proposed by Briand et al. (1998). If these assigned values are consistent with common intuition, they are expected to improve the ability of the metrics to precisely indicate class quality. Therefore, we empirically explore the validity of the enhanced metrics using several software systems, both including and not including those classes that may have previously led to undefined cohesion values. We compare and discuss the empirical results for both cases. The empirical validation involves fifteen cohesion metrics, including the most common cohesion metrics in the literature, applied to classes from four open-source Java systems. Interestingly, the results of the validation study show that the ability of the considered metrics to explain the presence of faulty classes and, therefore, the ability of the metrics to precisely indicate class quality, are improved for most of the metrics considered. Consequently, the results indicate that our assignment of values to cases whose cohesion was previously undefined increases the applicability of the metrics and also improves the precision of most metrics in indicating class quality. This conclusion is based on the widely accepted assumption, used in many studies (e.g., Briand et al. 1998, Briand et al. 2001, Gyimothy et al. 2005, Aggarwal et al. 2007, and Marcus et al. 2008), that any metric that predicts faulty classes more precisely should be a better class quality indicator. The major contributions of this paper are as follows: 1. Proposing and applying criteria for assigning cohesion values for the classes that have previously led to undefined cohesion values. The assigned values make cohesion metrics applicable for classes of any number of attributes and methods. 2. Empirically studying the applicability of fifteen class cohesion metrics, before being improved, to classes of four Java open-source systems. 3. Empirically studying the relationship between the cohesion values and the presence of faulty classes in four Java open-source systems using fifteen class cohesion metrics before and after improving their applicability. This paper is organized as follows. Section 2 reviews related work. Section 3 defines criteria for assigning cohesion values and applies them to several metrics. Section 4 presents several empirical case studies and reports and overviews their results. Finally, Section 5 concludes the paper and discusses future work. 2. Related Work Yourdon et al. (1979) proposed seven levels of cohesion. These levels include coincidental, logical, temporal, procedural, communicational, sequential, and functional. The cohesion levels are listed in ascending order of their desirability. Since then, several cohesion metrics have been proposed for procedural and object-oriented programming languages. Different models are used to measure the cohesion of procedural programs, such as control flow graphs (Emerson 1984), variable dependence graphs (Lakhotia 1993), and program data slices (Ott and Thuss 1993, Bieman and Ott 1994, Meyers and Binkley 2007, Al Dallal 2007a, Al Dallal 2009b). Cohesion has also been measured 3

4 indirectly by examining the quality of the structured designs (Troy and Zweben 1981, Bieman and Kang 1998). Several class cohesion metrics have been proposed in the literature. These metrics are based on data that is made available during high- or low-level design phases. The attributes and their types and the methods and tier parameters are examples for data that is available during the high-level design (HLD) phase. The attributes that are accessed by the methods in a class are examples for data that is available during the low-level design (LLD) phase. The HLD cohesion metrics identify potential cohesion issues early in the HLD phase. The LLD cohesion metrics use finer-grained data than the HLD cohesion metrics. This data precisely defines the relationships between attributes and methods. Cohesion Among Methods in a Class (CAMC), Normalized Hamming Distance (NHD), Scaled NHD (SNHD), Distance Design-based Direct Class Cohesion (D 3 C 2 ), and Similarity-based Class Cohesion (SCC) are examples of HLD metrics (Al Dallal and Briand 2010a). The CAMC metric (Bansiya et al. 1999) uses a parameter-occurrence matrix that includes one row for each method and a column for each data type used at least once as the type of a parameter in at least one method in the class. The value in row i and column j in this matrix equals 1 when the i th method has a parameter of the j th data type and it equals 0 otherwise. The CAMC metric is defined as the ratio of the total number of 1s in the matrix to the total size of the matrix. The NHD metric (Counsell et al. 2006) uses the same parameter-occurrence matrix as that used by CAMC. The metric calculates the average parameter agreement between each pair of methods. The parameter agreement between a pair of methods is defined as the number of entries for which the corresponding rows in the parameter-occurrence matrix exactly match. Scaled NHD (SNHD) (Counsell et al. 2006) is a metric that represents the closeness of the NHD metric to the maximum value of NHD, as opposed to the minimum value. The Distance Design-based Direct Class Cohesion (D 3 C 2 ) metric (Al Dallal 2007b) uses a matrix that features a row for each method and a column for each distinct parameter type that matches an attribute type. The value in row i and column j in the matrix equals 1 when the j th data type is a type of at least one of the parameters or a return value of the i th method and it equals 0 otherwise. The relative distance between a pair of methods is defined as the number of corresponding entries that have different values divided by the number of columns in the matrix. The cohesion is calculated by measuring the average relative distances between each pair of methods and subtracting the result from 1. The SCC metric (Al Dallal and Briand 2010a) uses a matrix called a Direct Attribute Type (DAT) matrix, which has the same definition above as for D 3 C 2. The Similarity-based Class Cohesion (SCC) metric is the weighted average of four different metrics that take into account method-method, attribute-attribute, and attribute-method direct and transitive interactions (i.e., interactions caused by method invocations). The LLD metrics are based on the use or sharing of class instance variables. The Lack of Cohesion of Methods (LCOM1) metric (Chidamber and Kemerer 1991) counts the number of pairs of methods that do not share instance variables. Chidamber and Kemerer (1994) proposed another version of the LCOM metric, referred to here as LCOM2, which 4

5 calculates the difference between the number of method pairs that do and do not share instance variables. Li and Henry (1993) use an undirected graph that represents each method as a node and the sharing of at least one instance variable as an edge. Class cohesion, referred to here as LCOM3, is measured in terms of the number of connected components in the graph. This class cohesion approach was extended by Hitz and Montazeri (1995), who added an edge between a pair of methods when one invokes the other. We refer to the extended metric as LCOM4. Bieman and Kang (1995) proposed two class cohesion metrics, namely Tight Class Cohesion (TCC) and Loose Class Cohesion (LCC), to measure the relative number of directly connected pairs of public methods and relative number of directly or indirectly connected pairs of public methods, respectively. A method is directly connected to an attribute when the attribute appears within the method's body. A method m is indirectly connected to an attribute when the attribute appears in the body of a method directly or transitively invoked by method m. Two methods are directly connected if they are each directly connected to an attribute and they are indirectly connected if each of them is directly or transitively connected to a different attribute and if both attributes are connected to a third method. The Degree of Cohesion (DC D ) cohesion metric is similar to TCC and the DC I cohesion metric is similar to LCC (Badri 2004) but they also consider two methods to be connected if they invoke the same method. Henderson-Sellers (Henderson-Sellers 1996) proposed a metric called LCOM5 that measures a lack of cohesion across multiple methods by considering the number of methods that reference each attribute. They define LCOM5=(a-kl)/(l-kl), where l is the number of attributes, k is the number of methods, and a is the summation of the number of distinct attributes accessed by each method in a class. Briand et al. (1998) proposed a cohesion metric called Coh that computes cohesion as the ratio of the number of distinct attributes accessed in the methods of a class. The Class Cohesion (CC) metric (Bonja and Kidanmariam 2006, Gui and Scott 2006) uses the degree of similarity between methods as a basis to measure class cohesion. The similarity between a pair of methods is defined as the ratio of the number of shared attributes to the number of distinct attributes referenced by both methods. Cohesion is defined as the ratio of the summation of the similarities between all pairs of methods to the total number of possible pairs of methods. The Sensitive Class Cohesion Metric (SCOM) (Fernandez and Pena 2006) is similar to the CC metric. The only difference lies in the definition of similarity. Given a class that has l attributes, the similarity between a pair of methods i and j, which reference the set of attributes I i and I j, respectively, is formally defined as follows: I i I j I i I j Similarity ( i, j) min( I, I ) l i j The LLD similarity-based class cohesion (LSCC) metric (Al Dallal and Briand 2010b) defines the similarity between each pair of methods as the ratio of the number of shared attributes between the methods to the total number of attributes in the class. Similar to 5

6 CC and SCOM, cohesion, measured by LSCC, is defined as the ratio of the sum of the similarities between all pairs of methods to the total number of possible pairs of methods. Wang et al. (2005) introduced a Dependence Matrix-based Cohesion (DMC) class cohesion metric based on a dependency matrix that represents the degree of dependence among the instance variables and methods in a class. Chen et al. (2002) used dependence analysis to explore attribute-attribute, attribute-method, and method-method interactions. They measure cohesion as the relative number of interactions. Al Dallal (2011a) proposes a new class cohesion metric called path connectivity class cohesion (PCCC), based on counting the number of possible paths in a graph that represents the connectivity pattern of the class members. PCCC was empirically compared with most of the metrics considered in the current paper in terms of the ability of the metrics to detect faulty classes and to differentiate between classes that feature different connectivity patterns. The empirical study had the limitation addressed in Section 1 of the current paper, which is excluding the classes that feature undefined cohesion values using most of the cohesion metrics. The current paper eliminates this limitation and therefore, allows for better representative results than before. The definition of the cohesive interaction differs from one metric to another. In TCC, LCC, DC D, DC I, LSCC, CC, SCOM, LCOM1, LCOM2, LCOM3, and LCOM4, accessing the same attribute by two methods in a class of interest is considered as a cohesive interaction between the two methods. An access to an attribute by a method in a class of interest is considered by Coh and LCOM5 as a cohesive interaction between the method and the attribute. CAMC considers the presence of a parameter type in the declaration of a method in a class of interest as a cohesive interaction between the method and the parameter type. Finally, the NHD metric considers the sharing of either the presence or absence of parameter types between two methods as a cohesive interaction between the two methods. Most of the existing cohesion metrics have undefined values for classes in several special cases. We will focus on four such cases: (1) the case in which a class with no methods has attributes, (2) the case in which a class with a single method does not have attributes, (3) the case in which a class with a single method has attribute(s), and (4) the case in which a class with multiple methods does not have attributes. We focus on these four cases because they are the only cases for which one or more of the considered LLD metrics have undefined cohesion values. The considered HLD metrics have undefined cohesion values for the following four corresponding cases: (1) the case in which a class does not have methods, (2) the case in which a class with a single method does not have parameter types, (3) the case in which a class with a single method has parameter type(s), and (4) the case in which a class with multiple methods does not have parameter types. Table 1 lists the metrics considered in this paper and their values for the aforementioned special cases in terms of the number of methods, attributes, and parameter types. In the table, m is the number of methods, and a is the number of attributes for LLD metrics or the number of distinct parameter types for HLD metrics. The table shows that only three 6

7 metrics, namely LSCC, LCOM3, and LCOM4, are defined for all cases. The remaining metrics are undefined for most of the cases considered. All the considered metrics are defined for the remaining case, which is not included in the table (i.e., the case in which m>1 and a>0). In this paper, values are assigned for the special cases in which the metrics are undefined. Although some metrics are defined for all cases, it is important to improve other metrics by defining them for all cases because each metric has its own advantages and applications. For example, none of the considered metrics that are defined for all cases is applicable during the HLD phase. In addition, previous empirical studies (e.g., Al Dallal and Briand 2010b) show that some of the other metrics, such as LCOM2, LCOM4, and TCC, contribute to optimal fault-prediction models. Metric m=0 and a>0 m=1 and a=0 m=1 and a>0 m>1 and a=0 LSCC LCOM1, LCOM2 undefined undefined undefined m(m-1)/2 TCC, LCC undefined undefined undefined 0 DC D, DC I undefined undefined undefined α/np, where α is the number of method pairs that share directly or transitively the invocation of a third method and NP is the number of method pairs CC, SCOM, LCOM5 undefined undefined undefined undefined LCOM m LCOM Coh undefined undefined α/a, where α is the summation of the number of distinct attributes accessed by each method in a class. Table 1: Defined and undefined values for the considered metrics Briand et al. (1998) define four mathematical properties that provide a supporting underlying theory for class cohesion metrics. The first property, called nonnegativity and normalization, describes that the cohesion measure belongs to a specific interval [0, Max], where Max is a fixed number for the considered metric that is independent of the analyzed system. Normalization allows for easy comparisons between the cohesion metrics of different classes. The second property, called null value and maximum value, holds that the cohesion of a class equals 0 if the class has no cohesive interactions (i.e., interactions among attributes and methods of a class) and the cohesion is equal to Max if all possible interactions within the class are present. For inverse cohesion measures (e.g., 7 Number of disjoint components considering method invocations undefined CAMC undefined undefined 1 undefined NHD undefined undefined undefined undefined

8 LCOM1 and LCOM2), the lack of cohesion of a class equals Max if the class has no cohesive interactions and the lack of cohesion is equal to 0 if all possible interactions within the class are present. The third property, called monotonicity, holds that adding cohesive interactions to the class cannot decrease its cohesion. For inverse cohesion measures, the property holds that adding cohesive interactions to the class cannot increase its lack of cohesion. The fourth property, called cohesive modules, holds that merging two unrelated classes into a single class does not increase the class' cohesion, which means that the merged class has a cohesion value that is less than or equal to the maximum value among the cohesion values of each the two unrelated classes. Therefore, given two classes, c1 and c2, the cohesion of the merged class c' must satisfy the following condition: cohesion(c') max{cohesion(c1), cohesion(c2)}. For inverse cohesion measures, the lack of cohesion of the merged class must satisfy the following condition lack_of_cohesion(c') min{lack_of_cohesion(c1), lack_of_cohesion(c2)}. If a metric does not satisfy one or more of these properties, it is considered ill-defined (Briand et al. 1998). The mathematical properties proposed by Briand et al. were widely used to support the theoretical validation for several proposed class cohesion metrics (e.g., Briand et al. 1998, Zhou et al. 2002, Zhou et al. 2004, Al Dallal 2009a, Al Dallal 2010a). 3. Assigning Cohesion Values When assigning a value for a metric applied on a class that exhibits a special case, two necessary criteria must be considered, as follows. Criterion 1: Satisfying key cohesion properties. The first criterion is that the assigned value must not cause the metric to violate any of the widely used and accepted cohesion properties proposed by Briand et al. (1998) and explained in Section 2. Thus, the assigned values must not cause the metric to be theoretically invalid. Otherwise, the metric becomes ill-defined and its usefulness for indicating cohesion becomes questionable. The only exception to this criterion is when the original definition of the metric violates a property. In this case, the validation of the metric regarding the assigned value can depend on the validation result of the other cases that already have defined values. For example, validating a metric against the cohesive modules property requires merging the class under consideration with another class c and comparing the cohesion value of the merged class with each of the cohesion values of the original classes. Therefore, the result of the validation does not depend only on the assigned value but also on the values given by the metric definition for both class c and the merged class, which might have already violated the cohesive modules property. Specific examples are later provided for the case in which the original definition of the metric violates the cohesive modules property when the values are assigned to LCOM5 (Section 3.2) and NHD (Section 3.5). The key cohesion properties must be satisfied by cohesion metrics but they are not sufficient to conclude whether the metric is indeed a good cohesion metric. Criterion 2: Consistency with common intuition. Common intuition is one way to judge whether a metric indicates cohesion correctly. Therefore, the assigned value must not contradict the common intuition built on (1) the general cohesion definition (i.e., the 8

9 relatedness among the attributes and methods in a class) and (2) the cohesive interaction definition as declared by the metric under consideration. As shown in Table 1, there are four cases for which certain cohesion metrics have undefined values. The first case occurs when a class that has attributes does not have methods. Our interpretation for this case is that we would expect all the attributes to describe the features of the same object. Therefore, the class is expected to be fully cohesive (i.e., the cohesion must be at the maximum and the lack of cohesion must be at the minimum). Conversely, if the attributes are representing unrelated global variables for the whole program, the cohesion of the class must be at the minimum. However, by manual inspection, we found that this case is not featured in any of the classes of the systems considered in this paper, which indicates that this case occurs rarely and that, therefore, the assigned cohesion value should not be based on it. The general definition for a cohesive module (i.e., class in object-oriented software) is that it performs a single task and cannot be easily split (Bieman and Ott 1994). A class that has a single method and no attributes (second case in Table 1) is associated with a single task, that is the task performed by its method and cannot be easily split. Therefore, the cohesion value for such a class must be the maximum (the lack of cohesion must be the minimum). However, when the class with a single method has one or more attributes, the intuition becomes dependent on the definition of the cohesive interaction as specified by the metric under consideration. If the cohesive interaction definition is based on the relationships between methods, the cohesion must be the maximum, similar to the previous case, because the class cannot be easily split. However, if the cohesive interaction definition is based on the relationship between the method and the attributes, the cohesion value depends on the (relative) number of attributes referenced by the method. Similarly, if a class with several methods does not have any attributes, the cohesion depends on the cohesive interaction between methods specified by the metric. If the metric considers only the relationships between methods through attributes, the cohesion value must be the minimum, because the methods are unrelated. However, if the metric definition also considers the relationships between methods through method invocations, the cohesion value depends on the (relative) number of methods that are related through invocations. If the metric does not account for relationships through invocations, the methods will be considered unrelated and the cohesion will be the minimum. We applied these criteria to assign values for the metrics under consideration for all the undefined cases. This paper does not consider the case in which a class has neither attributes nor methods, because such a class is not frequently declared and used in objectoriented systems Method-Method Interaction (MMI) Metrics TCC, LCC, DC D, DC I, CC, SCOM, LCOM1, and LCOM2 are MMI metrics because they focus on measuring the interactions between each pair of methods. These metrics have the maximum cohesion value of 1 (value of 0 for the lack of cohesion metrics LCOM1 9

10 and LCOM2), and they are undefined for any classes that have fewer than two methods. In addition, CC and SCOM are normalized by dividing the degree of interactions between methods by a factor that depends on the number of attributes. If the class does not have attributes, these metrics values are undefined (the cohesion value cannot be infinity). The detailed undefined cases for these metrics are shown in the second, third, fourth, and fifth rows in Table 1 (the header row is not counted). A. Cohesion of a Class with No Methods According to Criterion 2, when the class does not have any methods, the class cohesion value as defined by the MMI metrics must be 1. In this case, the assigned value remains in the range of values that do not cause the metrics to violate the non-negativity and normalization property. In addition, the class has no cohesive interactions, which is at the same time the highest possible number of cohesive interactions. Therefore, the assigned value of 1 does not cause the metrics to violate the null value and maximum value property. The assigned value does not cause the metrics to violate the monotonicity property because no cohesive interactions can be added to the class and, therefore, the cohesion cannot be decreased. Finally, the assigned value does not cause the metrics to violate the cohesive modules property because merging the class under consideration with any other class cannot cause the merged class to assume a cohesion value higher than the class under consideration, because the assigned value of 1 to the class under consideration is the highest possible cohesion value. This discussion demonstrates that the assigned value satisfies Criterion 1. In addition, the assigned value satisfies Criterion 2 because it represents the maximum possible value, as discussed earlier. The same argument applies for both LCOM1 and LCOM2 but we recall that these metrics measure the lack of cohesion, and therefore, the opposite conclusions (e.g., regarding the maximum and minimum values) are drawn. For example, the lack of cohesion value for the class under consideration is 0. In addition, when the class under consideration is merged with another class C, the lack of cohesion value of the resulting class cannot be less than that for the class under consideration, because the class under consideration exhibits the minimum possible lack of cohesion value. B. Cohesion of a Class with a Single Method Based on Criterion 2, when a class has a single method (the second and third cases in Table 1), the cohesion value as defined by the MMI metrics must be 1 (0 for the lack of cohesion metrics). In this case, the assigned value remains within the range of values that do not cause the metrics to violate the non-negativity and normalization property. Although the class has no cohesive interactions, the class features the maximum possible number of cohesive interactions, which is zero, because the class has a single method and, therefore, that single method has no other methods with which it can interact. As a result, the assigned value does not cause the metrics to violate the null value and maximum value property. The assigned value does not cause the metrics to violate the monotonicity property because no cohesive interactions can be added to the class in this case because the class has a single method. Finally, the assigned value does not cause the metrics to violate the cohesive modules property because merging such a class with any other class cannot cause the merged class to assume a cohesion value higher than the 10

11 class under consideration, because the class under consideration has the highest possible cohesion value. As a result, the assigned value satisfies Criterion 1, and its assignment is based on Criterion 2. C. Cohesion of a Class with Multiple Methods and No Attributes A class with multiple methods and no attributes has defined values for all the considered MMI metrics except CC and SCOM. According to Criterion 2, the CC and SCOM values must be 0. As a result, the assigned value remains in the range of values that do not cause the metrics to violate the non-negativity and normalization property. In addition, the class has no cohesive interactions and, therefore, the assigned value 0 does not cause the metrics to violate the null value and maximum value property. The assigned value does not cause the metrics to violate the monotonicity property because the assigned value is the minimum possible value and, therefore, adding a cohesive interaction to the class cannot decrease the cohesion value. When the class under consideration is merged with another class c, the resulting class will have the same number of attributes and similarities between each pair of methods as class c and more methods than in class c. CC and SCOM values are inversely proportional to the number of methods, thus the resulting class will have CC and SCOM values less than those for class c, which verifies that the assigned value does not lead to a violation of the cohesive modules property. As a result, the assigned value satisfies Criterion 1 and it is assigned according to Criterion LCOM5 LCOM5 is a lack of cohesion metric and its value ranges within the interval [0, m/(m-1)], where m is the number of methods in a class (i.e., the maximum possible lack of cohesion value is 2 when m = 2). This metric is originally undefined for all of the four special cases, as shown in Table 1, and our proposed values for these cases are as follows. A. Cohesion of a Class with No Methods According to Criterion 2, when the class does not have methods, the lack of cohesion value must be the minimum, which is 0. In this case, the assigned value remains in the range of values that do not cause LCOM5 to violate the non-negativity and normalization property. In addition, the class cannot have cohesive interactions and, therefore, the assigned value 0 does not cause LCOM5 to violate the null value and maximum value property. The assigned value does not cause LCOM5 to violate the monotonicity property because no cohesive interactions can be added to the class and, therefore, the cohesion cannot be increased. When the class under consideration is merged with another class, the resulting class cannot have an LCOM5 value less than that for the class under consideration, because the class under consideration has the minimum possible lack of cohesion value. Thus, the cohesive modules property is satisfied. As a result, the assigned value does not cause LCOM5 to violate Criterion 1 and it is assigned according to Criterion 2. B. Cohesion of a Class with a Single Method and No Attributes Based on Criterion 2, when a class with a single method does not have attributes, its LCOM5 value must be 0 (i.e., the minimum possible lack of cohesion value). Using the 11

12 same arguments as given in Section 3.1.B, the assigned value does not cause LCOM5 to violate any of the non-negativity and normalization, null value and maximum value, or monotonicity properties. Finally, the assigned value does not cause LCOM5 to violate the cohesive modules property because merging such a class with any other unrelated class cannot cause the merged class to assume a lack of cohesion value that is lower than that of the class under consideration, because the class under consideration has the lowest possible lack of cohesion value. This confirms that the assigned value satisfies Criterion 1. In addition, the assigned value satisfies Criterion 2 as it represents the minimum possible lack of cohesion value, as discussed previously. C. Cohesion of a Class with a Single Method and One or More Attributes When a class with a single method has one or more attributes, its LCOM5 value depends on the number of attributes referenced by the method. LCOM5 measures lack of cohesion, it has a maximum value of 2, and it considers the relative number of attributes referenced by the methods. Accordingly, the assigned LCOM5 value in this case is 2 1- α/a), where α is the number of attributes referenced by the method and a is the number of attributes in the class. This value remains in the range of values [0, 2] that do not cause LCOM5 to violate the non-negativity and normalization property. In addition, if the class has all possible cohesive interactions, then the lack of cohesion value will be minimum, and if the class does not have any cohesive interactions, then the lack of cohesion value will be maximum. Thus, the assigned value does not cause LCOM5 to violate the null value and maximum value property. When a cohesive interaction is added to the class, the LCOM5 value becomes smaller, which does not violate the monotonicity property for lack of cohesion metrics. LCOM5 already violates the cohesive modules property (Al Dallal 2009a). When the class under consideration is merged with another unrelated class c, to satisfy the cohesive modules property, the merged class must have a lack of cohesion value greater than or equal to the lack of cohesion value for class c, which is not always true. For example, if class c has two methods and any number of attributes such that none of the attributes is referenced by the methods, then the LCOM5 value for class c will be 2. If this class is merged with the class under consideration, then the maximum possible LCOM5 value will be 1.5, which causes LCOM5 to violate the cohesive modules property. This case is cited as an exception to Criterion 1. Considering the other properties, the assigned value satisfies Criterion 1. In addition, the assigned value satisfies Criterion 2 as it represents the relative number of attributes referenced by the methods of the class. D. Cohesion of a Class with Multiple Methods and No Attributes Based on Criterion 2, a class with multiple methods and no attributes must have the maximum possible lack of cohesion value, which is equal to m/(m-1) for LCOM5, where m is the number of methods. In this case, the assigned value remains in the range of values that do not cause LCOM5 to violate the non-negativity and normalization property. In addition, the class has no cohesive interactions and, therefore, the assigned value does not cause LCOM5 to violate the null value and maximum value property. The assigned value does not cause LCOM5 to violate the monotonicity property because the assigned value is the maximum possible lack of cohesion value. Therefore, adding a 12

13 cohesive interaction to the class cannot increase the lack of cohesion value. The same argument and example, in Section 3.2.C, regarding the cohesive modules property, applies here, indicating that the assigned value cannot satisfy the cohesive modules property because LCOM5 itself violates the property. Except for this property, the assigned value satisfies Criterion 1. In addition, the assigned value satisfies Criterion 2 as it represents the maximum possible lack of cohesion value Coh Coh accounts for the relative number of attributes referenced by methods, its value ranges within the interval [0, 1], and it is undefined for classes that do not include methods or attributes, as shown in Table 1. Our proposed values for these cases are as follows. A. Cohesion of a Class with No Methods According to Criterion 2, when the class does not have any methods, the cohesion value must be the maximum, which is 1. Using the same arguments as given in Section 3.1.A, the assigned value does not cause Coh to violate any of the non-negativity and normalization, null value and maximum value, monotonicity, and cohesive modules properties and, therefore, Coh satisfies Criterion 1. As a result, the assigned value satisfies Criterion 1 and it is assigned according to Criterion 2. B. Cohesion of a Class with a Single Method and No Attributes Based on Criterion 2, when a class with a single method does not have any attributes, its Coh value must be 1 (i.e., the maximum possible cohesion value). This value satisfies Criteria 1 and 2 for the same reasons as discussed above in Section 3.1.B. C. Cohesion of a Class with Multiple Methods and No Attributes According to Criterion 2, a class with multiple methods and no attributes must have the minimum possible cohesion value, which is 0. This value does not cause Coh to violate any of the non-negativity and normalization, null value and maximum value, or monotonicity properties for the same reasons discussed above in Section 3.1.C. When the class under consideration is merged with another class c, the resulting class will have more methods than class c and the same number of attributes and number of attributes referenced by methods as class c. The value of Coh is inversely proportional to the number of methods, thus the resulting class will have a Coh value of less than that for class c, which verifies that the assigned value does not lead to a violation of the cohesive modules property. As a result, the assigned value satisfies Criterion 1 and it is assigned according to Criterion CAMC CAMC considers the relative number of distinct types used by the parameters of the methods, its value ranges within the interval [0, 1], and it is undefined for classes that do not include methods or parameters, as shown in Table 1. Our proposed values for these cases are as follows. 13

14 A. Cohesion of a Class with No Methods According to Criterion 2, when the class does not have methods, the cohesion value must be the maximum, which is 1. Using the same arguments given in Section 3.1.A, the assigned value does not cause CAMC to violate any of the non-negativity and normalization, null value and maximum value, monotonicity, or cohesive modules properties, and therefore, CAMC satisfies Criterion 1. In addition, the assigned value satisfies Criterion 2 as it represents the maximum possible cohesion value, as discussed earlier. B. Cohesion of a Class with a Single Method and No Parameters Based on Criterion 2, when a class has a single method that does not have parameters, its CAMC value must be 1 (i.e., the maximum possible cohesion value). This value satisfies Criteria 1 and 2 for the same reasons as discussed above in Section 3.1.B. C. Cohesion of a Class with Multiple Methods and No Parameters According to Criterion 2, a class with multiple methods, of which none have parameters, must have the minimum possible cohesion value, which is 0. This value does not cause CAMC to violate any of the non-negativity and normalization, null value and maximum value, or monotonicity properties for the same reasons as discussed above in Section 3.1.C. When the class under consideration is merged with another class c, the resulting class will have the same number of distinct parameter types and number of types of parameters of each method as in class c, and more methods than class c. The value of CAMC is inversely proportional to the number of methods, thus the resulting class will have a CAMC value less than that for class c, which confirms that the assigned value does not lead to a violation in terms of the cohesive modules property. As a result, the assigned value satisfies Criterion 1 and it is assigned according to Criterion NHD NHD is a lack of cohesion metric based on measuring the agreement, in terms of sharing either the presence or absence of parameter types, between each pair of methods. NHD value ranges within the interval [0, 1], and it is undefined for any classes that have fewer than two methods. In addition, NHD is normalized by dividing the degree of agreements between methods by a factor that depends on the number of parameter types in the class. If none of the methods in the class have parameters, then the metric will be undefined, else the cohesion value will be infinity. Our proposed values for these cases are as follows. A. Cohesion of a Class with No Methods According to Criterion 2, when a class does not have methods, the lack of cohesion value using NHD for the class must be 0. Using the same arguments as outlined in Section 3.2.A, the assigned value does not cause the metric to violate any of the non-negativity and normalization, null value and maximum value, monotonicity, or cohesive modules properties. As a result, the assigned value does not violate Criterion 1 and it is assigned according to Criterion 2. 14

15 B. Cohesion of a Class with a Single Method Based on Criterion 2, when a class has a single method with or without parameters, its NHD value must be 0 (i.e., the minimum possible lack of cohesion value). This value satisfies Criteria 1 and 2 for the same reasons as discussed above in Section 3.1.B. C. Cohesion of a Class with Multiple Methods and No Parameters Referring to Criterion 2, a class with multiple methods, of which none have parameters, must have the maximum possible lack of cohesion value, which is 1. This value does not cause NHD to violate any of the non-negativity and normalization, null value and maximum value, or monotonicity properties for the same reasons as discussed above in Section 3.2.D. Similarly to the discussion in Section 3.2.C, the assigned value fails to satisfy the cohesive modules property because this property depends on the NHD value assigned by the formula of the metric, which already violates the cohesive modules property (Al Dallal 2010a), for other classes that can be merged with the class under consideration. In the context of these other properties, the assigned value satisfies Criterion 1. In addition, the assigned value satisfies Criterion 2 as it represents the maximum possible value for the lack of cohesion, as discussed earlier. The assigned metric values for the four special cases under consideration are summarized in Table 2. All these values satisfy both Criteria 1 and 2, with the exceptions stated for Criterion 1. These assigned values allow the metrics to be applicable for any class. The next section investigates whether the assigned values improve the ability of the metrics to indicate class quality in terms of fault prediction more precisely than before the value assignment. Metric m=0 and a>0 m=1 and a=0 m=1 and a>0 m>1 and a=0 TCC, LCC, CC, SCOM, CAMC DC D, DC I Assigned before LCOM1, LCOM Assigned before LCOM *(1-α/a), where α is the summation of the number of distinct attributes accessed by m/(m-1) each method in a class. Coh 1 1 Assigned before 0 NHD Table 2: The assigned values for the metrics under consideration 4. Empirical Validation We present four analyses. The first explores the applicability of the metrics before assigning values to those cases with undefined values, where the applicability of a metric is defined as the percentage of classes for which the metric has defined values. The goal of this study is to determine the percentages of classes that have undefined values in selected systems, thus empirically confirming the importance of this research. In addition, the results of this analysis are used to compare the metrics under consideration in terms of their applicability. The second analysis investigates the correlations between the 15

16 metrics and compares their strengths in detecting faulty classes. This analysis is only applied to the classes that have defined values using all the metrics taken into consideration. After assigning values, as proposed in this paper, to the classes that previously had undefined cohesion values, we explore whether adding the classes that previously had undefined values may improve the results of our empirical study. The third analysis applies to each metric and investigates the strength of each metric in detecting faulty classes, taking into account only those classes that previously had undefined values for that metric. These classes are assigned cohesion values based on what is proposed in this paper. The goal of this analysis is to empirically evaluate the correctness of our proposed cohesion values for the undefined cases. The fourth analysis is similar to the second but is applied to all classes, including those that previously had undefined values. The goal of this analysis is to compare outputs with the results of the second analysis and thereby to explore the effect of taking all of the classes into account, including those that previously had undefined cohesion values, on the correlation and faulty classes prediction results. In the last three analyses, we rely on the widely accepted assumption, used in many studies (e.g., Briand et al. 1998, Briand et al. 2001, Gyimothy et al. 2005, Aggarwal et al. 2007, and Marcus et al. 2008), that any metric that predicts faulty classes more precisely than others should be a better cohesion indicator. Therefore, we use the strength of the metric in predicting faulty classes as an indirect indicator for the strength of the metric in indicating class cohesion. Developers are expected to prefer applying cohesion metrics that better indicate cohesion over those that are worse at indicating cohesion. Building a fault prediction model is not one of our empirical study goals, because that would require considering other factors such as size, coupling, and complexity, all of which are out of the scope of this paper Software Systems and Metrics We chose four Java open-source software systems from different domains: Art of Illusion v.2.5 (Illusion 2009), GanttProject v (GanttProject 2009), JabRef v.2.3 beta 2 (JabRef 2009), and Openbravo v (Openbravo 2009). Art of Illusion consists of 481 classes and about 88 thousand lines of code (KLOC) and is a 3D modeling, rendering, and animation studio system. GanttProject consists of 468 classes and about 39 KLOC, and is a project scheduling application featuring resource management, calendaring, and import/export of several formats (MS Project, HTML, PDF, spreadsheets). JabRef consists of 569 classes and about 48 KLOC and is a graphical application for managing bibliographical databases. Openbravo consists of 447 classes and about 36 KLOC and is a point-of-sale application designed for touchscreens. We chose these four open-source systems from The restrictions taken into account in choosing these systems were that they (1) are implemented using Java, (2) are relatively medium in terms of the number of classes, (3) are from different domains, and (4) have available source code and fault repositories. 16

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International