Analyzing the software development process with SyQL and Lagrein

Size: px

Start display at page:

Download "Analyzing the software development process with SyQL and Lagrein"

Milo Stephens
6 years ago
Views:

1 Analyzing the software development process with SyQL and Lagrein Mirco Bianco Center for Applied Software Engineering Free University of Bolzano Via della Mostra, 4 I Bolzano-Bozen Mirco.Bianco@unibz.it Alberto Sillitti Center for Applied Software Engineering Free University of Bolzano Via della Mostra, 4 I Bolzano-Bozen Alberto.Sillitti@unibz.it Giancarlo Succi Center for Applied Software Engineering Free University of Bolzano Via della Mostra, 4 I Bolzano-Bozen Giancarlo.Succi@unibz.it Abstract Mining information from software products metrics and software process data is very hard[14]. Automatic collected data from the source code metrics extractors and from the software development process probes have different formats, it makes difficult to use both at the same time. In this paper, we present a data manipulation language called System Query Language (SyQL), which overcomes problems of other similar languages and allows the user to access data stored in a relational-temporal database. Developers and managers can look at effort data and code metrics by writing very concise SQL-like queries and by using linguistic variables that are unavailable in other existing similar query languages. SyQL helps the user to access temporal data of the software process, providing a set of temporal constructs. Examples of problems solved using SyQL queries and Lagrein (a tool for source code analysis) are provided, evidencing the advantage of the proposed approach. Keywords Query languages, data warehouses, software metrics, effort, development process. 1. Introduction Mining information from software process and products metrics at the same time is challenging [5]. The relations between them can be different depending on the analysis to perform. Typically, researchers mine information from relational data warehouses in asynchronous way, using SQL to perform data extraction and other data manipulation tools (Weka, RapidMiner, Matlab, etc) to perform elaboration, such as filtering, clustering, etc. These data warehouses grow up to 1.5 GB/day [14], therefore the asynchronous approach is very time consuming. Moreover, the structure of the data warehouse is usually fairly complex [13]. To overcome such problems we propose a new language: System Query Language (SyQL). SyQL is a domain specific language based on fuzzy-temporal logic to query a data-warehouse of software process data [15] through SQL like queries. SyQL is used inside Lagrein [7] for retrieving and visualizing the historical data about the software development process (code metrics, effort, bugs, etc.). Software development takes place over time. To allow the user to consider the time aspect when evaluating software metrics, SyQL offers the possibility to filter data using temporal conditions. The paper is organized as follows: sections 2 defines the goals of this work, section 3 discusses the related work, section 4 presents our solution, section 5 gives an overview of our automatic metrics collection system, section 6 introduces the syntax of SyQL, section 7 describes how the SyQL query engine works, section 8 shows examples of visualization, finally section 9 draws the conclusion and presents future directions. 2. The Goals SyQL has been designed to achieve the following goals: Build an abstraction layer between the user and the tables of the data-warehouse; Make the query preparation process against the metrics data warehouse [14] trivial; Help software engineers to evaluate software along the timeline; Support the evaluation of the effort spent by the developers along the temporal line; Help the user to evaluate product quality using simple logic constructs; Make the language extensible. Summarizing, SyQL has a high aggregation capacity and it supports extensible fuzzy logic and temporal functions. The Fuzzy logic is useful for performing qualitative analysis on large datasets, which sometimes is more useful than quantitative analysis, because the user cannot a priori estimate the value of software metrics [19]. The user can miss some important results if

2 he/she uses a wrong threshold value. Therefore, a fuzzy set encapsulates the experience to evaluate a particular metric. By temporal functions, we mean language clauses that help the user to write shorter queries. Temporal functions are needed because the software development process evolves over time, and so queries must include temporal conditions [1][16][18][10]. 3. Related work There are several works on languages that can be used to query repositories of software data. The features that appear most relevant to consider are: the capabilities to perform temporal queries on product and process metrics, the possibility to help the user to filter the results through linguistic variables [17] (such as high, medium, low), and the possibility to be used into a general context. In addition it is important to consider some other technical aspects such as: support to combined analysis (software metrics/effort), temporal management, fuzzy logic support, supported programming languages (languages from which the tool is able to extract information for analysis tasks), and object orientation. In Table 1 we use such criteria to compare some of the most relevant existing work and SyQL. Language Integrated Query LINQ [9] is designed to be embedded into another programming language. Therefore, queries can be performed with the same expressive power from a program written either in C# 3.5, VB 9.0, or another.net language. FuzzySQL [4] is a commercial relational database front-end; it supports fuzzy conditions and it is designed to assist the user during the analysis tasks..ql [11] is a commercial tool designed to perform code analysis tasks as reverse engineering and discovery of bad code smells. DmFSQL [2] is a general-purpose fuzzy query language data-mining oriented implemented as an Oracle database front-end. SCQL [6] is a domain-specific temporal query language used to retrieve information from a relational database containing information gathered from a source control system. NDepend is an application, which uses CQL (Code Query Language), for extracting information from.net projects. With this program is possible to extract a lot of information from the source code. 4. Our proposal To enable the final user to perform fuzzy-temporal query against a metrics data warehouse [15] we decided to implement a new query language, SyQL. The reasons of this choice are now discussed, showing the main differences with those of languages introduced above. The syntax of SyQL is similar to the one of LINQ [9], but it is designed to achieve different purposes. LINQ is more general and can perform queries on different data sources, while SyQL is tight to a specific data source (the metrics data warehouse [15]). Both of them are fully object oriented; SyQL allows the use of Fuzzy equal operator and temporal tokens, LINQ does not. The main difference between SyQL and FuzzySQL [4] is that FuzzySQL is a general-purpose relational database front-end, while SyQL is a specific tool to perform information retrieval tasks on metrics data warehouse [14] with additional features to handle temporal analysis of the software development process. SyQL can be used to perform software metrics and effort analysis, on the contrary.ql [11] can handle only software data. SyQL can perform tasks on different project written in different programming languages, while.ql can perform analysis only on Java projects. SyQL supports the fuzzy logic conditions,.ql does not. SyQL is completely different from dmfsql [2], the only evident similarity between them is the fuzzy logic support, because the purposes of these two languages are different. Both SCQL [6] and SyQL have keywords to manage temporal data. The main difference is that SyQL is designed to be extended to handle different aspects of development process (effort, software metrics, requirements, etc.), while SCQL is designed only to perform information retrieval tasks on software repository data. SCQL has not fuzzy logic support. NDepend and SyQL have been designed for achieving different goals. With NDepend is easier keep under control a set of.net projects, because it is highly integrated with the.net environment, on the other hand SyQL is more platform independent (it supports also C/C++ and Java) and it wants to help the users to control different aspect of the software development process. With SyQL is possible to visualize and compare the values of a specific metric into a specified time interval (e.g. show the total number of line of code in the last 6 months), with NDepend is possible only to compare two different versions of the code showing the changes. SyQL makes possible running real effort analyses on source code (e.g. compute the total effort spent by the developers on a specific package/namespace), it enables the user to track the bug fixing process showing which methods had been modified during a specific fixing task, with NDepend it is not possible.

3 5. Languages Table 1: Comparison between different query languages. Support to combined analysis (software metrics/effort) Temporal management Fuzzy Logic General Purpose language Supported programming languages (for analysis task) Object Orientation LINQ [9] NO NO NO YES None YES FuzzySQL [4] NO NO YES YES None NO.QL [11] NO NO NO NO Java YES dmfsql [2] NO NO YES YES None NO SCQL [6] NO YES NO NO None NO NDepend 1 NO NO NO NO All.NET languages YES SyQL YES YES YES NO C/C++, Java, C#, VB.NET YES Architecture description Before presenting the architecture of SyQL and how the results are displayed, we are going to give a brief introduction to our distributed non-intrusive system for collecting software metrics [14]. Figure 1 shows the role of SyQL and Lagrein in the system. The metric collection system is distributed: the applications plugins are installed on the clients and they are able to trace the user activities inside the most common IDEs (Microsoft Visual Studio, Eclipse, etc.); the Source Code Analysis components runs on a standalone machine that takes daily snapshots of the source code from the Versioning System. These components send the collected data to the Metric Server using Apache XML- RPC protocol implementation. Then, the Metrics Server organizes these data and stores them inside the relational data warehouse. The extracted information are delivered to the managers and to the developers in two possible ways, either by an automatic statically generated report (using Eclipse BIRT) or by Lagrein/SyQL in a "dynamic/visual" way. Figure 1: The System Architecture. 6. Language description We introduce the structure of the language through an example. [01] FROM Class c, Method m [02] WHERE c.getfullname() = [03] m.getdefclassfullname() [04] AND c.geteffort(yesterday) IS High [05] SELECT c.getfullname(), [06] c.geteffort(today 1 day ), [07] COUNT(m) [08] GROUP BY c.getfullname(), [09] c.geteffort(today 1 day ); The above query returns a collection of class names, the related effort spent by the developers since yesterday, and the number of methods for each class. The first row introduces the FromClause, which could contain one or more FromElement(s). Each of them is composed by two literals, the former identifies the concept type, the latter declares the concept name (like in SQL). The second, third and fourth rows introduce the WhereClause. In the example there are two conditions: an equal join condition and a fuzzy condition. The fuzzy condition evaluates the effort spent yesterday by the developers. The method c.geteffort(...) is a Java method that returns a value. In the fifth, sixth, and seventh rows the SelectClause is shown. This is a non empty collection of MethodCall(s) and/or aggregation functions (like Count, Sum, Max, Min, etc.). In the last two rows we declare the GroupByClause, which is similar to SQL one. As happens in others similar query languages [9] [11], we decide to put the FromClause at the beginning of the query for allowing to use the auto completion in Where, Select, and GroupBy clauses. 1

into a SyQL query are shipped in a separate library. This allows us to implement new concepts and new methods during the entire lifecycle of SyQL.

4 7. How the query engine works 7.1 Concepts and methods The extensibility is one of the main requisite of SyQL engine, different concepts (the non-terminal symbol FromElement) and methods (the non-terminal symbol MethodCall) used into a SyQL query are shipped in a separate library. This allows us to implement new concepts and new methods during the entire lifecycle of SyQL. Another advantage is that SyQL acts as an abstraction layer between the user and the data-warehouse. Therefore, we can modify the schema of the data-warehouse without affecting the user, if the library is updated properly. Implementing a new concept in SyQL has only one requirement: an instance of one concept must be an entry of a relation defined with a SQL statement. In this way, we can perform the mapping between the SyQL concepts and the tables. The materialization of the object is performed through a constructor, which takes as input an entry of the relation defined above. All the methods of a concept class that can appear in the SyQLExpression are annotated in two different ways. An annotated method can become part of an external or an internal calculable condition. A method can be annotated as external if the returned value is present in one column of the defining concept relation, otherwise it must be annotated as internal. If a condition, which is represented by an instance of SyQLRelationalExpression, is composed by at least one internal calculable method, it must be evaluated into the SyQL query engine, otherwise it can be evaluated by the query engine of the underlying DBMS. The FuzzyExpression(s) are internal by default. 7.2 Query Execution The SyQL query engine works on top of the DBMS (Figure 2). Figure 2: The Data Layers. The SyQL query engine has been implemented without the need of developing a sophisticated query planner and executor. The idea is to push as much conditions as possible into the query engine of the underlying DBMS, in this way we obtain better time performance because the SyQL query engine does not execute any join. To perform it correctly, we convert the conditions that appear in the WhereClause into an equivalent Conjunctive Normal Form (CNF) formula using Boolean algebra and the De Morgan s theorem. The CNF notation is very helpful, because a block of OR conditions can be processed by the underlying DBMS query engine only if all the conditions (inside the block) are evaluated as external, otherwise the block of conditions must be evaluated by the SyQL query engine. A condition is evaluated as an external one if all the predicates (of the condition) are external, otherwise a condition is evaluated internally. The query execution workflow is shown in Figure 3. Figure 3: The SyQL Query Workflow. To perform always this conversion, we convert the parsed formula into an equivalent Disjunctive Normal Form (DNF) formula. Then, we convert it into an equivalent CNF formula doing the Cartesian product among all the condition contained into the AND blocks. The most critical component for the perform-

5 ance is the internal condition evaluators, usually internal conditions require a lot of computation, because most of them need to fetch data from the database. To address this problem we adopted two solutions: 1) sorting these conditions according to their cost, the cost is estimated by the developer of the SyQL libraries during the implementation; 2) evaluating these conditions in parallel taking advantage of the modern parallel/multicore hardware architectures. 8. Query visualization SyQL query results may produce a large quantity of data. Extracting useful information from a large temporal series may be difficult for a human user. Inspect a large software system (about 1,000 classes) on a temporal line of one month (20 working days) generates about 20,000 values per selected class metric, assuming that we collect one metrics snapshot per day without specify any filtering condition. If we perform queries on methods instead on classes, the reader can easily understand how the number of results grows up. Computer animation can easily be a useful and intuitive solution for displaying evolving datasets [12]. We solve this problem mapping the query results inside the metric views of Lagrein. Mapping these results it is straight forward because the SyQL query engine is written in Java, the common implementation technology simplifies the integration between the two tools. 8.1 Introducing query visualization by examples Example 1: In this example we visualize the growth of the classes (in term of LOC) where the developers have spent high effort during the last four days. FROM Class c, Chron chr WHERE chr.getdate() >= TODAY - 4 'days' AND chr.getdate() < TODAY AND c.geteffort(today - 4 'days', TODAY) IS High SELECT c.getloc(chr.getdate()); The result of this query can be visualized either in an Evolution matrix (Figure 4) or in a Evolution Chart. The query above is a collection of ClassLOC instances. The ClassLOC class implements the interface ClassMetric. Through this interface is possible to retrieve the date, the class owner, and the value of the metric. In this way, it is possible to create an animated view of the growth of the classes in the last four days. It is also possible repeat this query for all the software metrics collected by the source code analyzer (Cyclomatic Complexity, Halstead Volume, CK metrics [3]). Figure 4: Evolution Matrix Example 2: It is also possible to create static views of the system. In this example we perform selection of the classes with high value of Coupling Between Objects (CBO). FROM Class c WHERE c.getcbo(today) IS High SELECT c; The result of this query (static result) can be visualized in several views (Figure 5) available in Lagrein (e.g., Inheritance tree, Dependency graph, etc). 9. Conclusion and future work This paper discussed a possible approach for visualizing and mining software metrics and software process data. The whole architecture of the metric collection system and the language structure of SyQL have been presented and a comparison to existing systems is provided, showing that SyQL can go further than the other existing languages. The query execution workflow has been discussed. As a proof of concept a set of examples has been provided to the reader. Now we are using this language to build training dataset for estimating the fault-proneness of a method. We will embed these models into SyQL concept libraries, so we will enable the language user to estimate the fault-proneness of a method simply from a SyQL queries.

6 Figure 5: Inheritance Tree of High CBO Classes References [1] M. Böhlen, J. Gamper, and C. Jensen. Multidimensional aggregation for temporal data. In Advances in Database Technology - EDBT 2006, pages , [2] R. Carrasco, M. Vila, and F. Araque. dmfsql: A language for data mining. In Proceedings of the 17th International Conference on Database and Expert Systems Applications, pages , [3] S. Chidamber and C. Kemerer. A metrics suite for object oriented design. IEEE TSE, 20(6): , [4] E. Cox. FuzzySQL a tool for finding the truth: the power of approximate database queries. PC AI, 14(1):48-51, [5] F. Fioravanti, P. Nesi. Estimation and prediction metrics for adaptive maintenance effort of object-oriented systems. IEEE TSE, 27(12): , [6] A. Hindle and D. M. German. SCQL: a formal model and a query language for source control repositories. In Proceedings of the 2005 workshop on Mining software repositories, pages 1-5, [7] A. Jermakovics, R. Moser, A. Sillitti, and G. Succi. Visualizing software evolution with lagrein. In OOPSLA Companion, pages , [8] M. Karaila and T. Systa. Applying template metaprogramming techniques for a domain-specific visual language An industrial experience report. In Proceedings of the 29th international Conference on Software Engineering, pages , [9] E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling object, relations and XML in the.net framework. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages , [10] B. Moon and F.V. Lopez. Efficient algorithms for large-scale temporal aggregation. IEEE Transactions on Knowledge and Data Engineering, 15(3): , [11] O. d. Moor, M. Verbaere, E. Hajiyev, P. Avgustinov, T. Ekman, N. Ongkingco, D. Sereni, and J. Tibble. Keynote Address:.QL for source code analysis. In Proceedings of the Seventh IEEE international Working Conference on Source Code Analysis and Manipulation, pages 3-16, [12] M. Pinzger, H. Gall, M. Fischer, and M. Lanza. Visualizing multiple evolution metrics. In Proceedings of the 2005 ACM Symposium on Software Visualization, pages 67-75, [13] K. Ramamurthy, A. Sen, and A.P. Sinha. Data Warehousing Infusion and Organizational Effectiveness. IEEE Transactions on Systems, Man and Cybernetics, Part A, 38(4): , [14] M. Scotto, A. Sillitti, G. Succi, and T. Vernazza. A non-invasive approach to product metrics collection. Journal of System Architecture, 52(11): , [15] M. Scotto, A. Sillitti, G. Succi, and T. Vernazza. Noninvasive collection of software metrics: some issues and experiences. In Sharing experiences on agile methodologies in open source software development, Polimetrica Publisher, Italy, pages 31-38, [16] J. Yang and J. Widom. Incremental computation and maintenance of temporal aggregates. The VLDB Journal, 12(3): , [17] L. A. Zadeh. The Concept of a Linguistic Variable and its Application to Approximate Reasoning. Information Science, 8: , [18] D. Zhang, A. Markowetz, V. Tsotras, D. Gunopulos, and B. Seeger. Efficient computation of temporal aggregates with range predicates. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages , [19] S. Zhang, J. Lu, and C. Zhang. A fuzzy logic based method to acquire user threshold of minimum-support for mining association rules. Information Sciences, 164(1-4): 1-16, 2004.

SyQL: Querying Software Process Data Through an Object-Oriented Metamodel

SyQL: Querying Software Process Data Through an Object-Oriented Metamodel Mirco Bianco 1 1 Faculty of Computer Science, Free University of Bolzano-Bozen, Italy {Mirco.Bianco}@unibz.it Abstract. The effective