Can Better Identifier Splitting Techniques Help Feature Location?

Size: px

Start display at page:

Download "Can Better Identifier Splitting Techniques Help Feature Location?"

Brook Parsons
5 years ago
Views:

1 Can Better Identifier Splitting Techniques Help Feature Location? Bogdan Dit, Latifa Guerrouj, Denys Poshyvanyk, Giuliano Antoniol SEMERU 19 th IEEE International Conference on Program Comprehension (ICPC 11) Kingston, Ontario, Canada

2 2

3 Textual information embeds domain knowledge 3

4 Textual information embeds domain knowledge About 70% of source code consists of identifiers* * Deissenboeck, F. and Pizka, M., "Concise and Consistent Naming", Software 4 Quality Journal, vol. 14, no. 3, 2006, pp

5 Textual information embeds domain knowledge About 70% of source code consists of identifiers* Identifiers are important source of information for maintenance tasks: traceability link recovery feature location * Deissenboeck, F. and Pizka, M., "Concise and Consistent Naming", Software 5 Quality Journal, vol. 14, no. 3, 2006, pp

6 # of Cumulative Feature Location Papers based on Textual Information 6

7 Related Work on Identifiers Takang et al. (JPL 96) programs with full-word identifiers are more understandable than those with abbreviated ones Lawrie et al. (ICPC 06) full words and recognizable abbreviations lead to better comprehension Binkley et al. (ICPC 09) CamelCase style is easier to recognize than CamelCase style is easier to recognize than underscore 7

8 Related Work on Identifiers Enslen et al. (MSR 09) Samurai: algorithm for splitting identifiers (using tables of identifier frequencies) Guerrouj et al. (JSME 11) TIDIER: algorithm for splitting identifiers (using contextual information) Other related work Deissenboeck and Pizka (SQJ 06), Antoniol et al. (ICSM 07), Haiduc and Marcus (ICPC 08), etc. 8

9 Splitting Identifiers Correctly is Challenging 9

10 Identifier Splitting Algorithms Original Identifier userid setgid print_file2device SSLCertificate MINstring USERID currentsize readadapterobject tolocale imitating DEFMASKBit 10

11 Identifier Splitting Algorithms Original Identifier Camel Case userid user Id setgid set GID print_file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject readadapterobject tolocale tolocale imitating imitating DEFMASKBit DEFMASK Bit 11

12 Identifier Splitting Algorithms Original Identifier userid setgid print_file2device SSLCertificate MINstring USERID currentsize Camel Case user Id set GID print file 2 device SSL Certificate MI Nstring USERID currentsize readadapterobject p j tolocale imitating DEFMASKBit Handles underscore and digits readadapterobject Fails at mixed cases tolocale imitating DEFMASK Bit 12 Fails at same case identifiers

13 Identifier Splitting Algorithms Original Identifier Camel Case userid user Id setgid set GID print_file2device print file 2 device SSLCertificate SSL Certificate MINstring MI Nstring USERID USERID currentsize currentsize readadapterobject readadapterobject tolocale tolocale imitating imitating DEFMASKBit DEFMASK Bit 13

14 Identifier Splitting Algorithms Original Identifier Camel Case Samurai userid user Id user Id setgid set GID set GID print_file2device print file 2 device print file 2 device SSLCertificate SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject readadapterobject read adapter object tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK Bit 14

15 Identifier Splitting Algorithms Original Identifier Camel Case Samurai Splits some cases where CamelCase cannot userid user Id user Id setgid set GID set GID print_file2device print file 2 device print file 2 device SSLCertificate SSL Certificate SSL Certificate MINstring MI Nstring MIN string USERID USERID USER ID currentsize currentsize current size readadapterobject readadapterobject read adapter object tolocale tolocale tol ocal e imitating imitating imi ta ting DEFMASKBit DEFMASK Bit DEF MASK Bit Oversplits 15

16 # of Cumulative Feature Location Papers based on Textual Information 16

17 # of Cumulative Feature Location Papers based on Textual Information Existing feature location techniques use Camel Case splitting 17

18 Information Retrieval FLT Generate corpus Preprocessing Remove non-literals Remove stop words Split identifiers Stemming Indexing Term-by-document matrix Singular Value Decomposition synchronized void print(testresult result, long runtime) throws IOException{ printheader(runtime); printerrors(result); printfailures(result); printfooter(result); } synchronized void print TestResult result long runtime throws IOException printheader runtime printerrors result printfailures result printfooter result User formulate query print TestResult result runtime IOException printheader runtime Generate results printerrors result printfailures result Ranked list printfooter result 18

19 Information Retrieval FLT Generate corpus Preprocessing Remove non-literals Remove stop words Split identifiers Stemming Indexing Term-by-document matrix Singular Value Decomposition print Test Result result run Time IO Exception print itheader run Time print Errors result print Failures result print Footer result print test result result run time io exception print head run time print error result print fail result print foot result print test result... User formulate query Generate results m Ranked list m

Term-by-document matrix Singular Value Decomposition User formulate

20 Information Retrieval FLT Generate corpus Preprocessing Remove non-literals Remove stop words Split identifiers Stemming Indexing Term-by-document matrix Singular Value Decomposition User formulate query Generate results print test result... m m Ranked list 20

21 IR and Dynamic Information FLT Generate corpus Preprocessing Remove non-literals Remove stop words Split identifiers Stemming Indexing Term-by-document matrix Singular Value Decomposition User formulate query Generate results Ranked list of executed methods Collect execution trace 21

22 Research Goal Evaluate how advanced splitting techniques impact the performance of feature location techniques 22

23 Information Retrieval FLT Generate corpus Preprocessing Remove non-literals Remove stop words Split identifiers Stemming Indexing Term-by-document matrix Singular Value Decomposition User formulate query Generate results Replace Camel Case with : Samurai Perfect Splitting algorithm (Oracle) Ranked list 23 B etter Worst

24 Extract Identifiers All Identifiers Building the Oracle 24

25 Extract Identifiers All Identifiers Same split? (CamelCase Samurai TIDIER) YES Building the Oracle Concordant Split Identifiers 25

26 Extract Identifiers All Identifiers Same split? (CamelCase Samurai TIDIER) Assume they are correct Building the Oracle YES Manually verified a sample Concordant Split Identifiers 26 Threat to validity

27 Manually Split Identifiers Extract Identifiers Manual Split All Identifiers Same split? (CamelCase Samurai TIDIER) NO Discordant Split Identifiers YES Building the Oracle Concordant Split Identifiers 27

28 Manually Split Identifiers Extract Identifiers Consensus between authors Manual Split All Identifiers Same split? (CamelCase Samurai TIDIER) Checked source NO code Discordant Split Identifiers YES Building the Oracle Concordant Split Identifiers 28

unchanged Extract Identifiers Manual Split All Identifiers Same split?

29 Examples: DT, i3, P754, zzz, etc. Identifiers that could not be split Manually Split Identifiers Left unchanged Extract Identifiers Manual Split All Identifiers Same split? (CamelCase Samurai TIDIER) NO Discordant Split Identifiers YES Building the Oracle Concordant Split Identifiers 29

30 Design of the Case Study 30

31 Design of the Case Study RQ: Does a FLT with an advanced splitting algorithm produce better results than the same FLT using the CamelCase splitting algorithm? 31

32 How to Compare two FLTs? 32

33 How to Compare two FLTs? Effectiveness measure for each feature Method IR LSI score M M M M M M M M M Gold set method Effectiveness = 5 33

34 How to Compare two FLTs? Effectiveness measure for each feature Method IR LSI score M M M M M M M M M Gold set method Effectiveness e ess = 2 IRDyn Method LSI score M M M M Method Executed method (from trace) 34

35 Which FLTs are we Comparing? 35

36 Software Systems Rhino 1.6R5 138 classes, 1,870 methods, 32K LOC Eaddy et al. s data* 2 datasets Dataset Size Queries Gold Sets Execution Information Rhino Features 241 Sections of ECMAScript documentation Eaddy et al.* Rhino Bugs 143 Bug title and Eaddy et al.* N/A description (CVS) * Full Execution Traces (from unit tests) 36

37 Software Systems jedit classes, 6.4K methods, 109K LOC 2 datasets t Dataset Size Queries Gold Sets Execution Information jedit Features 64 Feature (or Patch) title and description jedit Bugs 86 Bug title and description SVN SVN Marked Execution Traces Marked Execution Traces Datasets available at: 37

38 Generating the jedit Datasets SVN Commits between v4.2-v4.3 38

39 Generating the jedit Datasets SVN commit message Title + Description = Query

40 Generating the jedit Datasets Changed files Previous Version Compare using Eclipse AST Current Version Modified methods (gold set) 40

41 Presenting the Results 41

42 Presenting the Results Box plot of all effectiveness measure in datasets (e.g., 241 datapoints for Rhino Features ) Average Median 42

43 Rhino Features IR FLTs

44 IR FLTs Similar median and average Rhino Features

45 IR FLTs Rhino Features Similar median and average Rhino Bugs jedit Features jedit Bugs 45

46 IR FLTs Rhino Features Similar median and average Rhino Bugs Datasets with features have better results than datasets with bugs jedit Features jedit Bugs 46

47 IRDyn FLTs N/A Rhino Features Similar median and average Rhino Bugs jedit Features jedit Bugs 47

48 IRDyn FLTs N/A Rhino Features Similar median and average Rhino Bugs Datasets with features have better results than datasets with bugs jedit Features jedit Bugs 48

49 Compare FLTs by Percentages IR Oracle IRCamelCase (Baseline)

50 Compare FLTs by Percentages IR Oracle IRCamelCase (Baseline) /8 2/8 50

51 IR Rhino Features Rhino Bugs jedit Features jedit Bugs 51

52 IR Rhino Features Datasets with features vs. Datasets with bugs Rhino Bugs jedit Features jedit Bugs 52

53 IRDyn N/A Similar trend Rhino Features Rhino Bugs jedit Features jedit Bugs 53

54 Statistical Results Wilcoxon signed-rank test Null hypothesis There is no statistical ttiti significance ifi difference in terms of effectiveness between IR Samurai /IR Oracle and IR CamelCase Alternative hypothesis IR Samurai /IR Oracle has statistically ti ti significantly ifi higher effectiveness than IR CamelCase alpha =

55 IR Rhino Features The only statistical significant result (p=0.05) Rhino Bugs jedit Features jedit B 55

56 Qualitative Results Vocabulary mismatch between queries and code: Name of developers (e.g., Slava, Carlos) Identifiers specific to communication i (e.g., thanks, greetings, annoying) 56

57 Qualitative Results Features are more descriptive than bugs 57

58 Qualitative Results Features are more descriptive than bugs Words join and line are not mentioned 58

59 External Threats to Validity 2 Java applications (different domains) More systems needed Construct Errors may be present in Oracle and gold sets We used data produced by other researchers Internal Subjectivity and bias in building the Oracle Conclusion Non-parametric test: Wilcoxon signed-rank 59

60 Research Questions RQ1 Does IR Samurai outperform IR CamelCase in terms of effectiveness? NO RQ2 Does IR Samurai Dyn outperform IR CamelCase Dyn in terms of effectiveness? NO RQ3 Does IR Oracle outperform IR CamelCase in terms of effectiveness? In some cases (Rhino) RQ4 Does IR Oracle Dyn outperform IR CamelCase Dyn 60 in terms of effectiveness? NO

61 Future Work More systems and datasets Different maintenance tasks Traceability link recovery Consider other splitting algorithms 61

62 Conclusions Advanced splitting technique could improve FLTs We found some empirical evidence Splitting has more impact on IR FLT If execution information is available, it is not necessary to use an advance splitting technique 62

63 Thank you! Questions? William and Mary bdit@cs.wm.edud SEMERU 63

64 References Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The Effects of Comments and Identifier Names on Program Comprehensibility: An Experimental Investigation", Journal of Programming Languages, vol. 4, no. 3, 1996, pp Lawrie et al. (2006) Lawrie, D., Morrell, C., Feild, H., and Binkley, D., "What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June , pp Binkley et al. (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C., "To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May , pp Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K., "Mining i Source Code to Automatically Split Identifiers for Software Analysis", in Proc. of IEEE MSR'09, May , pp Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and Guéhéneuc, Y. -G G., "TIDIER: An Identifier Splitting Approach using Speech Recognition Techniques", JSME, vol. to appear,

Integrated Impact Analysis for Managing Software Changes. Malcom Gethers, Bogdan Dit, Huzefa Kagdi, Denys Poshyvanyk

Integrated Impact Analysis for Managing Software Changes Malcom Gethers, Bogdan Dit, Huzefa Kagdi, Denys Poshyvanyk Change Impact Analysis Software change impact analysis aims at estimating the potentially