Improving Origin Analysis with Weighting Functions

Improving Origin Analysis with Weighting Functions Lin Yang, Anwar Haque and Xin Zhan Supervisor: Michael Godfrey University of Waterloo Introduction Software systems must undergo modifications to improve its readability, simplifies its structure or in response to changes in user requests and software environment [5]. These activities involve renaming, moving, splitting and merging source code entities. Consequently, many entities that appear new in the later release are actually transformed from old entities. There exists several previous works [2, 3, 4] on this subject, which is termed Origin Analysis. Godfrey et al. [3, 4] proposed algorithms to find entity rename, split and merge over different releases by analyzing the similarity of call relations as well as various attributes of the program entities. In their approach, each caller function or callee function are treated with equal importance. However, some functions carry more weight than others as illustrated in the next paragraph. Consider three functions A, B and C with 4, 2 and 2 callees respectively as shown in Figure. Traditionally call relation matcher treats each callee equally, thus the similarity of A and B will be calculated as the same as that of A and C. However, with more information of the callee functions available, a better decision is possible. For example, if F and F2 are standard library functions while F3 and F4 are functions that are defined in the same file as the caller, then A and C are more likely to be matching functions than B and C. Another case is if F and F2 are called 00 times in the system while F3 and F4 are called only 5 times, then A and C are more likely to be matching functions than B and C. In order to capture such differences, each caller and callee function should be assigned a suitable weight, rather than treated equally. A B C F F2 F3 F4 F F2 F3 F4 Figure : An example of call relations. In this paper, we have designed two weighting schemes: hierarchy based and frequency based. We have also proposed an automatic approach for doing origin analysis based on Machine Learning techniques. Unlike traditional work, our approach does not require human input in picking the weights and thresholds. The case study we conducted on Ctags demonstrates the effectiveness of our approach.

Our contributions are as follows: We have developed two weighting schemes to measure call relation similarity: frequency based and hierarchy based, Design and implementation of a system that is flexible enough to be used as a platform of doing function based Origin Analysis. We have carried a case study on Ctags. Experiment result shows our approach achieves higher accuracy than unweighted call relationship analysis. We provide a mathematically proven methodology of comparing the usefulness of attributes in Origin Analysis. Establish an automatic function match identification framework based on decision tree learning. The remainder of this report is organized as follows: Section 2 defines our weighting models. Section 3 presents our origin analysis system. Section 4 introduces the relevant AI techniques and experiment methodology. Section 5 and 6 describes the case study and its result. The prediction platform based on our system is briefly talked about in Section 7. We conclude and talk about future work in Section 8. 2 Weighting Functions The design of weighting functions is explained in this section. The two categories are frequency based and hierarchy based. 2. Frequency Based Weighting Functions Intuitively, if a function has a lot of callees, then being called by this function carries a low weight. On the other hand, if this function only makes one call, then this one call carries a high weight. To implement this, the algorithm adjusts the weight of a function as a caller or callee according to the number of its callee set and caller set correspondingly. A number of monotony decreasing functions with different speeds is selected to represent the frequency based weighting category. As showed in Figure 2, the weight of a function called 0 times carries roughly 0% to 80% of the weight of a function called time, depending on the weighting function used. 2.2 Hierarchy Based Weighting Functions The other category of weighting function is the Hierarchy Based. The idea is to give weight according to the distance between the caller and callee. For example, it is desirable that library calls carry less weight than function calls within the same file. The question is how little the weight of a library call should be assigned in comparison to a same file function call. Instead of hand picking the values, we give each type a code value to be used as input and then use math functions to calculate their weights. The code value of Same File is, Same Directory 2, Same System 3 and Library Call 4. As shown in Figure 3, using the inverse function, library calls carries only 25% weight of the function calls made within same file.

Figure 2: The function plot of frequency based weighting functions Figure 3: The function plot of hierarchy based weighting functions 3 Function Based Origin Analyzing System There are three major components in the system if categorized by functionality: Fact Extractor, Data Analyzer and Attribute Learner. Figure 4 shows the data flow, illustrating how source code turns into final result that shows which weighting function or attribute is better. The entire process other than the human validation part is done automatically. 3. Fact Extractor Before any analysis can take place, much fact extraction must be done to prepare the data. This part is mostly done by using SWAG Kit and Beagle, which are two tools developed by the Software Architecture Group at the University of Waterloo. The goal here is to get the abstract information about the subject system and functions whose name and location were not changed. At first, source code files from two versions of the subject system are parsed using the SWAG Kit extractor. There are 3 steps in the extractor pipeline: cppx: Extract the facts. Produces *.ta from the original source files. prep: Prepare the facts. Produces *.o.ta from the extracted facts. linkplus: Link the facts. Produces out.ln.ta from *.o.ta. Once the basic entity extraction is done, the next step is preparing evolution facts from SWAG Kit output using evprep command of Beagle. The output file out.ev.ta contains facts about call relations, system structure and source info. The next step is loading facts into Beagle database.

Figure 4: Data Flow and System Overview. Rectangle represents processor while diamond represents data type. Blue means it is internal system while green means it is external system. Orange is data in human readable format while red is data objects. Purple rectangle is the various implementation of weighting functions, which extends the OAFuction. 3.2 Data Analyzer The task of Data Analyzer is to calculate attributes of each function pair using the abstract facts prepared by the Fact Extractor and algorithm specified by the user. First, the data generated by SWAG Kit and Beagle is fed to the Input Reader module to build the data structures. OASystem contains the information of the version of the system it represents. It knows things like how many functions it contains, what file and subsystem each function belongs to, etc. OAMapping is the data structure that links already matched functions between the two versions. It is implemented as a hashtable so that each query takes O() computation time. OAFunction is an abstract class that forms the basic foundation of the function entity it represents. The actual vital method that calculates the weight of that function is done by each individual class that extends it. After the abstract systems are built, the algorithm and weighting function specified by the user is run to calculate the desired attributes. This process essentially takes each function pair, looks at their information and fills in the attributes. Sample attributes include Overall Similarity, Caller Set Similarity and Callee Set Similarity.

3.3 Similarity Calculation In order for the similarity value to carry the same weight regardless of the size of the subject system, a relative similarity value rather than an absolute one is desirable. Through using a similarity value between 0 and, case dependent threshold tuning is avoided. Overall Similarity between two functions is calculated as: MatchingWeightCaller( Caller( f ), Caller( f 2), f, f 2) + MatchingWeightCallee( Callee( f ), Callee( f 2), f, f 2) OverallSim( f, f 2) = TotalWeight( Caller( f ), Caller( f 2), f, f 2) + TotalWeight( Callee( f ), Callee( f 2) Caller Set Similarity is: MatchingWeightCaller( Caller( f ), Caller( f 2), f, f 2) CallerSim ( f, f 2) = TotalWeight( Caller( f ), Caller( f 2), f, f 2) Callee Set Similarity is: MatchingWeightCaller( Caller( f ), Caller( f 2), f, f 2) CallerSim ( f, f 2) = TotalWeight( Caller( f ), Caller( f 2), f, f 2) where MatchingWeightCaller( set, set2, f, f 2) = MatchingWeightCallee( set, set2, f, f 2) = i MS i MS WeightAsCallee( i, f ) + WeightAsCaller( i, f ) + 3.4 System Output The output of the function analysis system is a text file containing the user requested attributes. In Figure 5, the first three columns are the similarity values calculate by using Log based weighting function. j MS 2 j MS 2 WeightAsCallee( j, f 2) WeightAsCaller( j, f 2) 3.5 Attribute Learner After the attributes table is filled, the next step is to determine which one of these attribute is a better indicator of a function match. The mechanism of how this is done is explained in detail in the next section, Experiment Methodology.

Figure 5: Sample output of training data. The first three columns are from Function Relation Analysis using Log based weighting function. The last column is from human validation. 4 Experimental Methodology In this section, the experiment methodology is explained in details. We decided against using precision and recall of a hand picked threshold to measure the performance for the following reasons: It's not automatic If the samples are changed in case of error in human validation or more samples are added in the experiment, the whole process of trying various threshold and calculating precision and recall have to be redone manually. It's not objective We could try many thresholds for our weighted function analysis and pick the best of them, meanwhile choose a threshold for the uniform function analysis that is suboptimal. While the performance gain in that case could be substantial, it's not reflecting the truth and rendering the result much less creditable. It's impossible to calculate the real recall value Log_Total Log_Caller Log_Callee Result 0.02225 0.02225 0 FALSE 0.04043 0.0448 0 FALSE 0.597 0.597 0 FALSE 0.09864 0.09864 0 FALSE 0.320 0.320 0 FALSE 0.22856 0.22856 0 FALSE 0.02349 0.0274 0 FALSE 0.0992 0.0203 0 FALSE 0.0574 0.0779 0 FALSE 0.069 0.069 0 FALSE 0.0649 0.0649 0 FALSE 0.34886 0.34886 0 FALSE 0.06 0.06 0 FALSE 0.45267 0.45267 0 FALSE 0.0222 0.0222 0 FALSE 0.50735 0.6846 0 FALSE 0.6467 0 0.80759 TRUE 0.257 0 0.37652 FALSE 0.5967 0 FALSE 0.4524 0 0.5927 FALSE While the precision can be calculated precisely by going through identified functions one by one, there's practically no way to calculate the real recall value. That would require us to find all matching pairs in the system and there could easily be more than hundreds of thousands of pairs. Some paper [] has used techniques to get a pseudo recall value. But we feel that there is no guarantee on how close the pseudo recall number will be to the actual recall number, thus making this result unreliable. Based on these arguments, we decided to use a totally objective and automatic machine learning approach that is able to give us quantitative measure the performance of various weighting functions. The technique is called Information Theory which is the basis of Decision Tree Learning in Machine Learning context. Decision Tree Learning is one of the simplest, and yet most successful forms of learning algorithm. A decision tree takes as input an object or situation described by a set

of attributes and returns the predicted output value for the input [6]. The correctness of a decision tree depends on the choice of the attribute tests. The goal of our verification process is to determine if one attribute is a better indicator of function matching than the others. This is essentially to find a formal measure of the usefulness of attributes, which is the same goal in Decision Tree Learning. The measure should have its maximum value when the attribute is perfect, which means that the attribute can divide all the matching pairs from non-matching pairs perfectly and have its minimum value when the attribute is of no use at all. One suitable measure is the expected amount of information provided by the attribute, where the term is used in the mathematical sense first defined in Shannon and Weaver [6]. The information content is defined as In practice, given a training set contains p positive examples and n negative examples, the estimate of the information contained is The information gain from an attribute test is the difference between the original information requirement and the new requirement: where The attribute with the highest information gain is the one that is the best classifier and in our case, the one that is the best indicator of function matching. 5 Case Study We have performed a case study on Ctags using our weighted function relation analysis system. We chose Ctags because it is a well known piece of software that is in wide use and is reasonably sophisticated. Plus, it is written in C programming language, so we could use SWAG KIT to extract, abstract and explore the software architecture. For our case study, we selected 2 releases of Exuberant Ctags. Those are Ctags 4.0. released in June 2000 and Ctags 5.6.0 released in May 2006. For the validation purpose, we need to know which functions are indeed matching functions. However, we cannot go through each function pair one by one, as there are 26 unmatched functions in version 4 and 578 unmatched functions in version 5, which makes 24848 combinations. Therefore, this process was done by first using Beagle to provide match candidate then manually go through the source code of those functions and decided if those are really matched or not. We used the ame matcher and Call relation matcher of Beagle in this case.

The name matcher calculates the longest common substring (LCS) of the name of the target entity against the names of each of the members of the candidate set, and normalizes the value against the average length of two entity names. We set the threshold to 0. so that we can get a larger set of candidates and not miss real matches. Out of the 39 match candidates suggested by Beagle, 5 were identified as true matches. The call relation matcher returns a normalized value indicating how closely the caller/callee sets of two entities match using uniform weight function relation analysis. Out of the 7 match candidates suggested by Beagle using threshold 0., 43 were identified as true matches. Once the matches are identified, this information along with the attributes calculated by the function analysis system is feed into the machine learning program. The result is shown in the next section. 6 Result Using the experiment methodology, the Information Gain associated with each weighting function is listed in Figure 6 and Figure 7. Total Similarity Caller Similarity Callee Similarity 0.032 0.00822 0.0365 0.032 0.00822 0.0365 ( ) 8 ( ) 3 ( ) + ln( ) 20 Max IG 0.03796 0.032 0.00822 0.0365 0.032 0.00822 0.0365 0.03796 0.00889 0.0354 0.03296 0.0092 0.02874 Figure 6: Frequency based weighting function versus uniform weighting function.

Total Similarity Caller Similarity Callee Similarity 0.032 0.00822 0.0365 0.0397 0.00537 0.03284 2 0.9 0.95 0. 99 0.02877 0.0055 0.0307 0.0334 0.00822 0.03706 0.0327 0.00822 0.0379 0.03278 0.00822 0.03666 Max IG 0.0379 Figure 7: Hierarchy based weighting function versus uniform weighting function The results shows that by using proper weighting function, such as in frequency based and in + ln( ) 0.95 hierarchy based, better performance can be achieved. The similarity values calculated by those weighting functions provide larger information gain. We notice that the increased information gain of using weighting functions over the uniform one is only around 5% in this case. After investigation, we are convinced that this is due to the fact that functions in Ctags are homogeneous. In Ctags, the number of incoming and outgoing calls of a function does not vary greatly. This means that frequency based weighting algorithms have little room to improve on. Also, all the functions in Ctags are in the same directory. This limits the difference any hierarchy based weighting function can make, and in turn restricts the potential improvement that can be achieved. If the functions in the subject system are heterogeneous in terms of call frequency and hierarchical call distance, then the performance of using weighting function can be increased significantly. 7 Function Match Identification System The ultimate goal in Origin Analysis is to be able to tell whether two functions or entities are similar enough to be considered the same. The traditional approaches use an attribute and a corresponding threshold as test. There are much research in finding good attributes and thresholds that work well in practice. Common approaches in this domain include name matching, parameter name matching, parameter type matching, function call relation matching, entity source code string matching, UML relationship matching [] and many other methods. The researches in this domain have typically focused on proving that a particular method and associated attribute is a better measurement than existing ones and the method can provide sufficient evidence on its own. Here, we propose an approach that can take advantage of all existing methods and is mathematically proven to be no worse than any of them. As we discussed in the section Experiment Methodology, the information theory can be used to find the best attribute out of a set of attributes. An approach, that keeps using the best attribute available for testing until no more attribute are available or required, can fully utilize the information contained in those attributes. The final result is mathematically more precise than using any attribute used alone if none of these attribute is overly dominant, which is the case in Origin Analysis. Decision Tree Learning algorithm is sufficient for this purpose. In Figure 8, we demonstrate the procedure of using Decision Tree Learning in function match identification.

We implemented the system that is capable of doing decision tree learning and predicting result. However, we consciously chose not to test the precision of the decision tree generated from using only function call relation information. As the small number in the information gain table indicates, the function call relation matcher cannot precisely predict the matching between functions simply because there is not enough information. On the other hand, we feel that if other attributes are also used in decision tree learning, the combined information is sufficient in making precise prediction. Figure 8: The procedures involved in using Origin Analysis System to identify entity match and in evaluating the prediction precision. 8 Conclusion and Future Work In this project we have developed two sets of weighting functions, designed and implemented an Origin Analysis system that uses them. Our hypothesis was that a well-designed weighted model in function relation analysis will outperform an unweighted one as the former makes use of more information. We performed a case study on the two versions of Ctags system and showed that our results are better than the un-weighted one. Our system is capable of giving quantitative comparison between various weighting functions. It also works on any variation of weighting functions or extra attributes. Another important advantage of our system is that it is completely automated and does not need any parameter picking or human input.

There are three directions in which future works can be carried out. One is to carry out more case studies. Another is to apply the experiment methodology to evaluate performance of various existing or new approaches in Origin Analysis. The third is to incorporate existing or new approaches in Origin Analysis into the prediction framework and produce a system that can reliable predict entity matching by using all the information available. More Case Studies By increasing the sample size, the accuracy of the result will be improved. By using Chernoff bound in ERM (Empirical Risk Minimization), We see that the sample size needed to achieve < 0.05 error with > 95% confidence is 2248. Right now, the training sample we have is of size 2768. The second reason to do more case studies is to increase the performance gain from using weighted function relation analysis. As discussed in the result section, Ctags offers little space for improvement due to its lack of subsystems and the fact that most of its functions share similar call frequency. A larger and more sophisticated system will give the weighted system more proper credits. Apply Experiment Methodology Using the methodology discussed in Section 4, one direction future work can take is to explore other related parts and incorporate newer features to compare their performances. Prospective possible works in this area would be evaluating the success of any algorithm or feature used in Origin Analysis. Examples include finding objective quantitative measure of the value of UML relationships, proving whether LCS is a better indicator than the number of character pairs in name matching, and whether parameter name matching is better than parameter type matching. Prediction System The third direction is to incorporate existing features used in Origin Analysis into the framework as discussed in Section 7. Mathematically speaking, the result is bound to be better, but it is interesting to see how good the prediction can be in practice and to measure its performance. References [] Z. Xing and E. Stroulia. UMLDiff: An Algorithm for Object-Oriented Design Differencing. In Proceedings of 20th IEEE International Conference on Automated Software Engineering (ASE 05), pages 54 65, 2005. [2] S. Kim, K. Pan, and E. J. Whitehead Jr. When functions change their names: Automatic detection of origin relationships. In Proceedings of the 2th Working Conference on Reverse Engineering (WCRE 2005), pages 43 52, Pittsburgh, Pennsylvania, USA, 2005. IEEE Computer Society. [3] Q. Tu and M.W. Godfrey, An integrated approach for studying architectural evolution, Proceedings of the 0th International Workshop on Program Comprehension, pp. 27-36, 2002. [4] M. W. Godfrey and L. Zou. Using origin analysis to detect merging and splitting of source code entities. IEEE Transactions on Software Engineering, 3(2):66 8, 2005.

[5] D. L. Parnas, Software Aging, Proceedings of 6th Intl. Conference on Software Engineering, Sorrento, Italy, pp. 279-287, May 994. [6] Stuart Russell and Peter orvig, Artificial Intelligence: A Modern Approach p653-p659