Using Information Retrieval to Support Software Evolution

Similar documents
Feature Location via Information Retrieval based Filtering of a Single Scenario Execution Trace

FLAT 3 : Feature Location & Textual Tracing Tool

Combining Probabilistic Ranking and Latent Semantic Indexing for Feature Identification

Integrated Impact Analysis for Managing Software Changes. Malcom Gethers, Bogdan Dit, Huzefa Kagdi, Denys Poshyvanyk

Feature Location via Information Retrieval based Filtering of a Single Scenario Execution Trace

An Efficient Approach for Requirement Traceability Integrated With Software Repository

IMPROVING FEATURE LOCATION BY COMBINING DYNAMIC ANALYSIS AND STATIC INFERENCES

An Approach for Mapping Features to Code Based on Static and Dynamic Analysis

Support for Static Concept Location with sv3d

Can Better Identifier Splitting Techniques Help Feature Location?

An Efficient Approach for Requirement Traceability Integrated With Software Repository

On the use of Relevance Feedback in IR-based Concept Location

A Heuristic-based Approach to Identify Concepts in Execution Traces

A Text Retrieval Approach to Recover Links among s and Source Code Classes

International Journal of Scientific & Engineering Research Volume 8, Issue 5, May ISSN

Fault Prediction OO Systems Using the Conceptual Cohesion of Classes

A Statement Level Bug Localization Technique using Statement Dependency Graph

TopicViewer: Evaluating Remodularizations Using Semantic Clustering

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

Applying Semantic Analysis to Feature Execution Traces

Measuring the Semantic Similarity of Comments in Bug Reports

Mapping Bug Reports to Relevant Files and Automated Bug Assigning to the Developer Alphy Jose*, Aby Abahai T ABSTRACT I.

Recovering Traceability Links between Code and Documentation

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Where Should the Bugs Be Fixed?

CORRELATING FEATURES AND CODE BY DYNAMIC

Mining Features from the Object-Oriented Source Code of a Collection of Software Variants Using Formal Concept Analysis and Latent Semantic Indexing

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Implementation of Measuring Conceptual Cohesion

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Recovering Documentation-to-Source-Code Traceability Links using Latent Semantic Indexing

Java Archives Search Engine Using Byte Code as Information Source

Non-negative Matrix Factorization for Multimodal Image Retrieval

Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic Algorithms

Advanced Matching Technique for Trustrace To Improve The Accuracy Of Requirement

Chapter 6: Information Retrieval and Web Search. An introduction

HOW AND WHEN TO FLATTEN JAVA CLASSES?

Towards a Taxonomy of Approaches for Mining of Source Code Repositories

Mining Software Repositories with Topic Models

International Journal for Management Science And Technology (IJMST)

Reconstructing Requirements Coverage Views from Design and Test using Traceability Recovery via LSI

An Object Oriented Runtime Complexity Metric based on Iterative Decision Points

Information Retrieval. hussein suleman uct cs

vector space retrieval many slides courtesy James Amherst

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY

Information Retrieval

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information Retrieval: Retrieval Models

Published in A R DIGITECH

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Distributed Case-based Reasoning for Fault Management

Text Analytics (Text Mining)

Configuring Topic Models for Software Engineering Tasks in TraceLab

A Survey of Feature Location Techniques

CS 6320 Natural Language Processing

A Topic Modeling Based Solution for Confirming Software Documentation Quality

Interactive Exploration of Semantic Clusters

Measuring Class Cohesion using Topic Modeling

Reading group on Ontologies and NLP:

Text Mining for Software Engineering

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Automatic Bug Assignment Using Information Extraction Methods

Concept Based Search Using LSI and Automatic Keyphrase Extraction

COMPARISON AND EVALUATION ON METRICS

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 11, May 2014

Information Retrieval & Text Mining

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Combining Information Retrieval and Relevance Feedback for Concept Location

Introduction to Information Retrieval

Text Mining: A Burgeoning technology for knowledge extraction

Non-negative Matrix Factorization for Multimodal Image Retrieval

5/8/09. Tools for Finding Concerns

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

What s a BA to do with Data? Discover and define standard data elements in business terms

Visualizing the evolution of software using softchange

Information Retrieval

Text Document Clustering Using DPM with Concept and Feature Analysis

Information Retrieval. (M&S Ch 15)

A New Context Based Indexing in Search Engines Using Binary Search Tree

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

SCOTCH: Test-to-Code Traceability using Slicing and Conceptual Coupling

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil

Why So Complicated? Simple Term Filtering and Weighting for Location-Based Bug Report Assignment Recommendation

SOFTWARE DEFECT PREDICTION USING PARTICIPATION OF NODES IN SOFTWARE COUPLING

Security Engineering for Software

Part I: Data Mining Foundations

Mining Web Data. Lijun Zhang

Eclipse Support for Using Eli and Teaching Programming Languages

Table of Contents 1 Introduction A Declarative Approach to Entity Resolution... 17

Text Analytics (Text Mining)

A Content Vector Model for Text Classification

Using Code Ownership to Improve IR-Based Traceability Link Recovery

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

Transcription:

Using Information Retrieval to Support Software Evolution Denys Poshyvanyk Ph.D. Candidate SEVERE Group @

Software is Everywhere Software is pervading every aspect of life Software is difficult to make reliable and bugfree no physical limitations expressive and flexible, but complex invisible Software is difficult to design and maintain as much as 80% of 100 billion lines of code in production are unstructured, patched, badly documented [van Vliet 00] up to 90% of the total cost of software are spent on maintenance [Erlikh 00]

Addressed Problem The software developer has to maintain large software systems with: Little or no domain knowledge Absence of the original developer Badly organized, missing, or out of date documentation

Software Maintenance Requires Structural Information - the structural aspects of the source code (e.g., dependencies, architecture) Dynamic information behavioral aspects of the program (e.g., feature implementation overlap) Lexical Information - nature of the problem domain derived from names - identifiers, comments, documentation, and other artifacts Historical Information history of changes (cochanges) of different software artifacts (e.g., CVS logs rationale for changes)

Multitude of Software Artifacts Source code Call graphs Design docs readanddispatch -- org.eclipse.swt.widgets.display checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control Bug Reports Documentation Execution Traces

Research Focus Structural Information - the structural aspects of the source code (e.g., dependencies, architecture) Dynamic information behavioral aspects of the program (e.g., feature implementation overlap) Lexical Information - nature of the problem domain derived from names - identifiers, comments, documentation, and other artifacts Historical Information history of changes (cochanges) of different software artifacts (e.g., CVS logs rationale for changes)

Multitude of Software Artifacts Source code Call graphs Design docs readanddispatch -- org.eclipse.swt.widgets.display checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control Bug Reports Documentation Execution Traces

Combining Structural, Dynamic and Lexical Information Analysis Tool 1 Parser Structural Information Source Code and Documentation Analysis Tool N Tracer AST Call Graph Data Flow Control Flow System Representation IR Tool LSI Semantic Space Semantic Similarities

Extracting Lexical Information from Existing Software Systems Expensive large amounts of domain knowledge needs to be encoded Existing approaches use natural language processing [Shepherd 07] and knowledge based methods [Etzkorn 97] Our approach is to use less expensive methods

Latent Semantic Indexing (LSI) Vector space model based Information Retrieval method [Dumais 94, Berry 95, Deerwester 90] It has been used successfully as a method to represent aspects of the meaning of words and passages in natural language Ignores word ordering and semantically unimportant terms (e.g., the, and, for ), syntactic relations, and morphology It does not utilize a predefined grammar or vocabulary

Why Use LSI? Captures essential semantic info via dimensionality reduction Overcomes problems with polysemy and synonymy (Programming and natural) language independent Easy to apply on the source code

LSI in Software Engineering Abstract data type identification [Maletic 01] Traceability link recovery [Marcus 03] Concept location [Marcus 04] Software artifact management [DeLucia 04] Requirements traceability [Hayes 04] Conceptual cohesion of classes [Marcus 05] Software clustering [Kuhn 05] Conceptual coupling of classes [Poshyvanyk 06] Feature location [Poshyvanyk 07]

Extracting and Indexing Lexical Information with Information Retrieval Parsing source code and extracting documents corpus is a collection of documents (e.g., methods) Removing non-literals and stop words common words in English, standard function library names, programming language keywords Preprocessing: split_identifiers and SplitIdentifiers Indexing and retrieving semantic information with Latent Semantic Indexing

How Does LSI Work on Source Code? process flag monitor run 3 2 6 method1 x x x method2 x x x x x x Singular Value Decomposition (SVD): [nm] = [nk] [kk] [mk], k < n Similarity Measure: Cosine of the contained angle between the vectors Consider only k largest singular values for the LSI subspace

Currently Addressed Maintenance Tasks Feature (or concept) location in source code Impact analysis in source code Fault prediction in source code Traceability link recovery

Currently Addressed Maintenance Tasks Feature (or concept) location in source code Impact analysis in source code Fault prediction in source code Traceability link recovery

Concept Location in Source Code

Concept Location in Practice and Research Static Dependency based search [Rajlich 00] IR methods [Marcus 04] Dynamic Execution traces - Reconnaissance [Wilde 92] Scenario based probabilistic ranking [Antoniol 06] Combined Profiling with concept analysis [Eisenbarth 03] Feature dependencies [Salah 05] Feature evolution [Greevy 05] PROMESIR [Poshyvanyk 07]

Concept Location with Regular Expressions

Our Motivation

Concept Location with Information Retrieval [Marcus 04]

JIRiSS Google Eclipse Search IRiSS

Feature Location with Information Retrieval and Formal Concept Analysis [Poshyvanyk 07]

Feature Location with Software Reconnaissance Scenario NOT exercising the feature readanddispatch -- org.eclipse.swt.widgets.display checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control [Wilde 92] Scenario exercising the feature checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.progressbar WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control

Feature Location with Scenario-based Probabilistic Ranking Scenario NOT exercising the feature readanddispatch -- org.eclipse.swt.widgets.display checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control Scenario exercising the feature checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.progressbar WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control [Antoniol 06] WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control

Shortcomings Static analysis: Highly dependent on naming conventions and the developer s talent to write good queries Ignores other existing relationships between software components (such as, dependencies) May miss important parts of the source code Dynamic analysis: Execution traces are large even for small systems Selecting multiple scenarios may be difficult Filtering the traces is equally problematic best filtering methods still return hundreds of methods

Probabilistic Ranking Of MEthodS and Information Retrieval Scenario NOT exercising the feature readanddispatch -- org.eclipse.swt.widgets.display checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control Scenario exercising the feature checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control [Poshyvanyk 07]

Feature Location with PROMESIR Feature identification decision making problem in presence of uncertainty Static (LSI) and dynamic (SPR) methods: LSI queries static documents SBP analyzes dynamic traces of execution scenarios Complementary results are combined via affine transformation

Scenario-based Probabilistic Ranking Collecting Execution Traces Multiple (ir) relevant scenarios are executed to collect traces processor emulation (VALGRIND for C++) to improve the precision of data collection byte code instrumentation (JIKES for Java) Knowledge-based filtering to eliminate noisy events Probabilistic ranking events are re-weighted (Wilde s equation is renormalized)

Feature Location with PROMESIR

Example of using PROMESIR Locating a feature in JEdit Feature: showing white-space as a visible symbol in the text area Steps: Run two scenarios Run query Explore results

First Scenario Exercising the Feature in JEdit Start Tracing

First Scenario Exercising the Feature in JEdit Stop Tracing

Second Scenario NOT Exercising the Feature in JEdit Start Tracing

Example of using PROMESIR Results PROMESIR IR-based rankings Scenario-based probabilistic rankings Number of methods identified by SPR 284 The position of the first relevant method according to IR ranking 56 Position of the first relevant method according to PROMESIR - 7

Case Studies Locating features associated with bugs in: Mozilla (4,853 classes; 53,617 methods) Eclipse (7,648 classes; 89,341 methods) Case study objectives: Compare PROMESIR with stand-alone feature location approaches: LSI and SPR

Locating Features in Mozilla PROMESIR Bug # LSI SPR PROMESIR over LSI over SPR 182192 37 94 2 18 47 216154 76 332 6 12 55 225243 24 472 6 4 80 209430 49 509 1 49 509 231474 18 494 1 18 494

Locating Features in Eclipse PROMESIR Bug # LSI SPR PROMESIR over LSI over SPR 5138 7 268 1 7 259 31779 2 170 1 2 165 74149 5 456 3 2 156

SIngle Trace Information Retrieval (SITIR) Single Execution Trace Source Code readanddispatch -- org.eclipse.swt.widgets.display checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control readanddispatch -- org.eclipse.swt.widgets.display checkdevice -- org.eclipse.swt.widgets.display isdisposed -- org.eclipse.swt.graphics.device drawmenubars -- org.eclipse.swt.widgets.display runpopups -- org.eclipse.swt.widgets.display filtermessage -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control WM_TIMER -- org.eclipse.swt.widgets.control windowproc -- org.eclipse.swt.widgets.display windowproc -- org.eclipse.swt.widgets.control

Collecting Execution Traces in SITIR Java Platform Debugger Architecture (JPDA) 1 Infrastructure to build end-user debugging applications for Java platform JPDA highlights: Debugger works on a separate virtual machine Minimal interference of a tracing tool with a subject program Separate thread-based traces Marked traces (start/stop recording) 1 http://java.sun.com/javase/technologies/core/toolsapis/jpda/

Case Studies Locating features in JEdit Locating features associated with bugs in Eclipse Case study objectives: Compare SITIR with other approaches Study the impact of full and marked traces on the results Study the impact of user queries on the results

Multiple Developers and Queries for JEdit Features Feature Dev Query Search Show white space Add marker 1 search find phrase word text 2 search final all forward backward case sensitive 3 find search locate match indexof findnext 4 searchdialog find findbtn searchselection save searchfileset searchandreplace 1 red dot newline whitespace view show display tab 2 show hide whitespace blank space display 3 symbol replace changecolor setvisible addlayer whitespace loadsymbol 4 userinput textareapainter paint whitespace newline pnt 1 marker select word display text 2 add remove marker markers 3 select highlight mark change background 4 buffer addmarker marker selection

Results for Different Users and Queries for JEdit Features Full Marked Feature Dev LSI Trace Trace SITIR full SITIR marked Search Show white space Add marker 1 1477 202 61 11 2 1477 202 243 57 3 1477 202 32 13 4 1477 202 189 36 1 1462 284 956 152 2 1462 284 626 130 3 1462 284 497 104 4 1462 284 78 23 1 1478 304 26 5 2 1478 304 1 1 3 1478 304 3242 662 4 1478 304 20 5 6 20 6 11 30 48 16 8 1 1 160 4

Locating Features in Eclipse - Results Feature SINGLE TRACE SPR PROMESIR SITIR Select 721 268 1 2 Add files 740 170 1 2 Search 771 456 3 2

Summary for SITIR Uses only single scenario Less sensitive to scenario selection Unobtrusive tracing mechanism IR-based indexing of source code can be easily extended to other languages SITIR results are comparable or better than other feature location methods

Currently Addressed Maintenance Tasks Feature (or concept) location in source code Impact analysis in source code Fault prediction in source code Traceability link recovery

Impact Analysis in Software

Impact Analysis in Object-Oriented Systems Using Structural Coupling Measures [Briand 99] Coupling between classes (CBO) [Chidamber 04] Response for class (RFC) [Chidamber 04] Message passing coupling (MPC) [Li 93] Data abstraction coupling (DAC) [Li 93] Information-flow based coupling (IPC) [Lee 95]

Conceptual Coupling between Methods - Example Methods from MySecManager class in Mozilla Policy Wrapper Exception Security Error CanCreateWrapper 1 1 2 1 1 CanCreateInstance 1 0 2 1 1 CanGetAccess 0 0 0 0 1

Conceptual Coupling in Object- Oriented Systems Measures overlap of shared names in comments and identifiers Conceptual coupling between A and B = 0.56 Class A 0.5 0.7 0.6 method1 0.2 Class B method1 0.6 method2 0.7 0.3 0.4 method2 0.4 method3 0.3 0.4 0.2 method3

Evaluation Methodology For a given software system, a set of bug reports is mined from the bug tracking repository The set of classes, which are changed to fix bugs, is mined from concurrent version system For each changed class, conceptual and structural coupling metrics are computed and the other classes in the software system are ranked using these measures Select top classes in each ranked list of classes produced by every metric

Example The bug #232570 reports some problems associated with ldap2.server.position values for ab pane and search order in Mozilla A starting point, class nsabdirectoryquery is identified using a feature location technique We need to identify other classes that need to be modified (4,853 classes in Mozilla!)

Example Compute structural and conceptual coupling of nsabdirectoryquery and all the other classes in Mozilla Rank Conceptual coupling Information flow coupling Classes Values Classes Values 1 nsabqueryldapmessagelistener 0.86 nsdebug 123 2 nsabmdbdirectory 0.81 nsaflatstring 121 3 nsabdirectoryquerysimpleboolexpression 0.79 4 nsabldapdirectory 0.76 5 nsabview 0.72

Precision vs. Recall All classes in Mozilla (4,853) Retrieved Relevant Precision = RelRetrieved Retrieved Recall = RelRetrieved Rel in Collection

Precision and Recall: Example Total # of modified classes 9 Precision = 2/5 * 100% = 40% Recall = 2/9 * 100% = 22% Rank Conceptual coupling Information flow coupling Classes Values Classes Values 1 nsabqueryldapmessagelistener 0.86 nsdebug 123 2 nsabmdbdirectory 0.81 nsaflatstring 121 3 nsabdirectoryquerysimpleboolexpression 0.79 4 nsabldapdirectory 0.76 5 nsabview 0.72

Mining Bugs and Actual Changes Bugzilla for Mozilla contains 256,613 bug entries Bugs between version 1.6 and 1.7 (1,021 entries) Mined 391 bug entries: With approved patches Officially closed (resolved) Contained modifications to at least 2 classes Statistics: Average # of modified classes per bug report 7.3 Standard deviation 14.6 Outliers removed: bug report #226439, contained the record number of modified classes (149!)

90 80 70 60 50 40 30 20 10 0 Using Conceptual Coupling to Rank Classes in Mozilla Precision Recall 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.25 Threshold Recall and Precision

12 10 8 6 4 2 0 Using Coupling between Objects (CBO) to Rank Classes in Mozilla 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Precision Recall Threshold Recall and Precision

Using Information-based Flow Coupling to Rank Classes in Mozilla 30 25 Precision Recall 20 15 10 5 0 300 280 260 240 220 200 180 160 140 120 100 80 60 40 20 Threshold Recall and Precision

Conclusions Combining lexical, structural and dynamic information shows promise in addressing a number of maintenance tasks Using LSI to retrieve lexical information proves to be efficient and fairly precise

Current and Future Work Investigate combinations of Information Retrieval and Natural Language Processing techniques to analyze lexical information in software artifacts Combing textual information with project history to support other maintenance tasks: Fault-prediction Traceability link recovery and maintenance

Related Publications Poshyvanyk, D., Guéhéneuc, Y. G., Marcus, A., Antoniol, G., Rajlich, V., Feature Identification using Probabilistic Ranking of Methods based on Execution Scenarios and Information Retrieval, IEEE Transactions on Software Engineering, vol. 33, no. 6, June 2007, pp. 420-432. Marcus, A., Poshyvanyk, D., and Ferenc, R., "Using the Conceptual Cohesion of Classes for Fault Prediction in Object Oriented Systems", IEEE Transactions on Software Engineering, to appear in 2008. Poshyvanyk, D., Marcus, A., Using Information Retrieval to Support Design of Incremental Change of Software, in the Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE2007), Atlanta, Georgia, November 5-9, 2007, pp. 563-566. Poshyvanyk, D., Marcus, A., Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source code, in the Proceedings of the 15th IEEE International Conf. on Program Comprehension (ICPC2007), Banff, Alberta, Canada, June 26-29, 2007, pp. 37-48. Liu, D., Marcus, A., Poshyvanyk, D., Rajlich, V., Feature Location via Information Retrieval based Filtering of a Single Scenario Execution Trace, in the Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE2007), Atlanta, Georgia, November 5-9, 2007, pp. 234-243. Poshyvanyk, D., Marcus, A., "The Conceptual Coupling Metrics for Object- Oriented Systems", in the Proceedings of the 22nd IEEE International Conference on Software Maintenance (ICSM2006), Philadelphia, PA, September 25-27, 2006, pp. 469-478.

Contact Denys Poshyvanyk denys@wayne.edu www.cs.wayne.edu/~denys SEVERE Group @