Renovating an Open Source Project: RECODER

Size: px

Start display at page:

Download "Renovating an Open Source Project: RECODER"

Matilda Stafford
5 years ago
Views:

1 School of Mathematics and Systems Engineering Reports from MSI - Rapporter från MSI Renovating an Open Source Project: RECODER Óscar León Fernández June 2009 MSI Report Växjö University ISSN SE VÄXJÖ ISRN VXU/MSI/DA/E/ /--SE

2 Abstract Renovate a program is always a hard work and full of problems. It requires a good knowledge of the old program and research and investigation. RECODER is a tool for supporting static metaprogramming of Java program sources. This tool has been patches over and over again since the new versions of Java were coming out. Because of that the code starts to be dirty and with a lot of patches. In order to clean up the implementation the code should be renovated. In this thesis, we present the changes that were introduced to the grammar and the implementation of the new parser for RECODER using a different technology. Key-words: AST, EBNF, Java, RECODER ii

3 Contents 1. Introduction Motivation Goal Criteria Overview of the report Background BNF and EBNF Description of RECODER Parser Generators for Java JavaCC ANTLR SableCC CookCC CUP Analysis of the JavaCC Parser Specification in the current RECODER version Analysis Conclusion Description of the new implementation Analysis Conclusion Evaluation Compatibility Maintainability Conclusion & Future Work Summary Conclusion Future Work References 32 iii

4 iv

5 1 Introduction The reverse engineering has been always used for renovating programs during all the time from the beginning of the modern computer. Nowadays, always are emerging new technologies what can improve the performance or facilitate the production of a determinate task. But to change from an old technology to a new one is not so easy. Some problems arise when you want to renovate a program, for example the functionality of the program has to be the same but changing the internal structure. In terms of the final user, they should not see any change in the functionality of the program. In terms of maintainability the program has to be easy to correct and easy to see in order to do easier the future maintenance. The understanding of the program is vital to renovate a program. The majority of software development time is wasted for the maintenance of the old programs and not for the development of the new software programs. For facilitate knowledge of the old programs the engineers need reliable information about the system but this is not always possible. Sometimes the documentation of the programs is imprecise, not complete or even sometimes is undocumented. Also the architecture described in a project not always matches with the reality that is programmed in the source code. That is why that the source most reliable of information is the code although the most difficult to analyze as well. Hence we need some tools to analyze the source code and return us the enough information for understanding the existing software systems. Reverse engineering is a process to obtain the necessary information for renovating the code. So we can recover from the implementation the design specification of a system. The recovered design specification helps us to understand the program before restructuring the code. Also the information can be used as a feed back for the new requirements specification or for helping with maintenance tasks. 1.1 Motivation RECODER has a lot of parts that are deprecated or need a deep remodelling. Some of these parts are based on old technologies, bad documented or dark code (code that is a really mess or very confusing). The main problem that we are confronting here is to renovate the front-end of the free source program RECODER. To be more precise we have to renovate the parser for Java code of RECODER. The parser is one of the most important parts of RECODER. RECODER was created when Java was in the version 1.2 and the grammar specification was done for this version at the beginning. When the new versions of Java were released the grammar was patched and changed in some places in order to follow the new specifications. In special, the last versions 1.5 and 1.6 introduce a lot of changes in the specification. These changes were consequently introduced to the RECODER s grammar with a lot of patches and hacks. All this patches did finally the grammar really dark, messy and bad to understand. For this reason is needed to introduce a new grammar with the new specification in order to clarify the understanding of the grammar and improve the readability and maintainability. Also is an opportunity for changing the parser generator and introduce a 1

6 new technology. The current parser for Java source code is based in JavaCC grammar and we transform and actualize this grammar to a new one based on ANTLR. 1.2 Goal Criteria This master thesis is focused in renovate the source code front-end of the tool RECODER. The first goal is to change the actual grammar upgraded to the actual version 1.6 to a new one specification specifically for the new version in order to do it clearer. The new grammar should use ANTLR instead of JavaCC. The Abstract Syntax Tree (AST) created by the new grammar should be the same than the old grammar, changing nothing in the final result when a program is parsed. It is desirable a good functionality with the new parser based on ANTLR instead of JavaCC. These can be confronted with the test suite that is in the last released version of RECODER. It should pass all the tests if it is possible and if is not give an explanation of the error and a possible solution. So it needs to be totally compatible the new parser with the old one based on different technologies, so the final user does not have to notice any change in the functionality. The new code has to be maintainable. The source code (in this case the grammar) has to be clear, for the next developer that want to deal with the new implementation for future changes. The new grammar has to be similar to the actual specification in order to maintain the compatibility. In summary, the program should do: Replace the old grammar based in Java 1.2 and patched for a new one based on Java 1.6 directly. Change the technology from JavaCC to ANTLR. Create the same AST than the old version. Should pass all the tests in the test suite. (if not explain why is failing) 1.3 Overview of the report This report is structured as follows: In Section 2, there is background information which is necessary to understand the problem and the parser tools that we can use. In Section 3 we describe the current grammar specification and we point to some weak parts of the grammar. In Section 4 we describe our new implementation saying the differences between the original ANTLR grammar, the final version and the JavaCC grammar. In Section 5 we talk about the results that we have obtained. Finally, in Section 6, we give a conclusion with a possible future work. 2

7 2 Background In this section we talk about the notations to represent grammar. We give an overview of RECODER that is the framework that we want to renovate continuing and a brief explanation about some parser generators for Java. The most important for us are JavaCC that is used in current version of the parser in RECODER, and ANTLR that is used in the new implementation of the parser. 2.1 BNF and EBNF Backus Normal Form is a formal mathematical way to describe a language. This is described with a set of terminals (tokens), a set of non-terminals, a start symbol that belong to the set of non-terminals and the rules or productions. The Extended Backus Normal Form is as its name says an addition to the normal notation to express lists and optional symbols in a very easy way. There are three more new symbols added to the notation. The asterisk or star * that means repeat zero or more times the symbol that is before. This symbol is also called Kleene star or Kleene closure. The plus + that means repeat one or more times. The last operator added is the optional operand that is represented with an interrogation mark? that means that the element appear zero or one time. 2.2 Description of RECODER RECODER is tool for supporting static (When the manipulation of the input program is done in compile time is called static, if is done in runtime is called dynamic) metaprogramming of Java program sources. RECODER realizes parsing and analyzes Java programs and also realizes transformations in the sources saving the results in new files. A metaprogram is a program that takes other program as input data and manipulates the input somehow. Figure 2.1: How works RECODER [1] RECODER derives a metamodel of all the entities found in Java source files and class files. The model contains a detailed syntactic representation that can be unparsed to a 3

8 file again. The Figure 2.1 shows program sources parsed and analyzed. After, RECODER derives a metamodel of the program with all the entities found in the sources. Then with a metaprogram as a input the code is transformed many times and then unparsed to a new source code file. RECODER reads all the source code completely assuming that the input program compiles correctly. All the transformations are applied to the source code, the bytecode is read only in declaration level, and in other words, the byte code is read in order to know which methods and fields belong to a determinate class. 2.3 Parser Generators for Java There are a lot of parser generators for Java but there are some parsers that are more used than others. Here we show an overview of some parser generator for Java. JavaCC is probably the most used actually. ANTLR is a good parser generator that is rising with a lot of users. SableCC, CookCC and CUP are other parser generators that are used at the moment [2]. A parser generator is a tool that creates a parser receiving as an input a formal description of the language normally expressed in a BNF or EBNF grammar. The grammar is used to describe the language that wants to be parsed. Normally, when a program is parsed is created a syntax tree. This syntax tree can be reduced eliminating the symbols that do not have information. These reduced trees are called Abstract Syntax Trees and normally the parser generators provide facilities to create these trees. In between of the nodes in the AST can be added actions that allow us to put code for calculating attributes or values. The abstract syntax tree of RECODER is created inside of this actions and not using the facilities of the parser generators JavaCC JavaCC [3] (Java Compiler Compiler) is an open source parser generator for the Java programming language (BSD license). The JavaCC grammar is LL(k) and can be written in EBNF notation. JavaCC generates top-down parsers (the tree is generated from the root to the leaves) and it does not allow left recursion because the LL(k) grammar cannot decide which that has to be taken. JavaCC also provides other standard capabilities related to parser generation like tree building, actions ANTLR ANTLR [4] (ANother Tool for Language Recognition) is a parser generator that automates the construction of language recognizers. It is possible to add actions with code snippets in different programming languages. Also support the tree building generator. ANTLR generate code for different target languages although generates Java code by default. The code that is generated is human-readable and easy to fold into other applications. The parser generated is a recursive descent recognizer using LL(*) as an extensions to LL(k) that uses arbitrary lookahead to make decisions depending on the rule. The code is under the license BSD. Support multiple target languages such as Java, C#, Python, Ruby, Objective-C, C, and C++. 4

9 2.3.3 SableCC SableCC [5] is a parser generator which generates object-oriented frameworks for building compilers. You can generate AST strictly typed, and are included tree walkers. SableCC have a clean separation between generated code and the code written for the user CookCC CookCC [6] is parser generator written in Java, but the target code can vary. It uses a template to generate source codes, so it is easy to add a new target language. Also comes with a suite of test cases to assist creating and testing new target languages. A unique feature of CookCC is allowing lexer/parser to be specified using Java annotations. This feature simplifies and eases the writing of lexers and parsers for Java CUP CUP [7] (based Constructor of Useful Parsers) is a system which generates LALR parsers from simple grammar specifications. It works similar to the well known parser generator YACC [8] and offers most of his features. However, CUP is written in Java, uses specifications including embedded Java code, and produces parsers which are implemented in Java. 5

10 3 Analysis of the JavaCC Parser Specification in the current RECODER version In this section we discuss about the current parser specification talking about the specification in depth, some patches that are showed in the grammar and code commented out. In the first subsection is showed the complete analysis of the grammar and then a conclusion with the ideas that comes out from the analysis. The current grammar that generates the parser has been patched and actualized every time that the specification of Java has changed. The patches that have been added have done the grammar more messy and unreadable. The grammar is written for an old version of the parser generator JavaCC. Because of that, not only the grammar is deprecated, also some functionalities that are used are not necessary anymore. 3.1 Analysis At the beginning of the grammar we can realize that the grammar has been patched during some time over and over again as you can see in the next code [Listing 3.1]. In the comment talk about one patch adding semicolons that is showed later in this section [Listing 3.18]. /** JavaCC AST generation specification based on the original Java1.1 grammar that comes with javacc, and includes the modification of D.Williams to accept the Java 1.2 strictfp modifier. Several patches have been added to allow semicolon after member declarations. Listing 3.1: The code has been patched first adding semicolons. Inside the body of the JavaCC parser there are some variables and methods that now are not useful. These methods are related to the use of the predicate.super inside of an explicit constructor invocation and are showed in the next code [Listing 3.2]. boolean superallowed = true private boolean issuperallowed() return superallowed private void setallowsuper(boolean b) superallowed = b Listing 3.2: Listing removed not used Some codes are to maintain the downward compatibility with the old versions of Java. In the old versions of Java there are no asserts, and this word can be a normal identifier. To control this are used this variables and method in order to be aware of the keyword assert. RECODER must be capable to parse the old programs for the old versions of Java as well as for new codes that are wrote on the new version of Java. This code was added when the new versions of Java were released [Listing 3.3]. 6

11 boolean jdk1_4 = false boolean jdk1_5 = false public boolean isawareofassert() return jdk1_4 public void setawareofassert(boolean yes) jdk1_4 = yes if (yes == false) jdk1_5 = false public boolean isjava5() return jdk1_5 public void setjava5(boolean yes) jdk1_5 = yes if (yes) jdk1_4 = true Listing 3.3: Downward compatibility There are some methods related to JavaCC for manage the position of the tokens in the code. Setting the size of the tabulation JavaCC can report the correct offset in the line. These methods are used for set and get the number of white spaces that are equivalent to one tabulation. In some codes the tabulation has to be fixed to a specific number of white spaces. In some languages the number of white spaces of is eight and in other codes it is four, that s why we need to be capable to fix the size of tabulation [Listing3.4]. public void settabsize(int tabsize) jj_input_stream.settabsize(tabsize) public int gettabsize() return jj_input_stream.gettabsize(0) // whatever... Listing 3.4: Code specific for JavaCC There are some global variables in RECODER that are used inside of the specification. These variables can be changed for local variables that this does the code easier to understand When you declare an array with a number unknown of dimensions the variable tmpdimension is used in the declarations to count the number of dimensions. This variable is incremented every time than a dimension is found when you are reading the source [Listing3.5] 7

12 /** temporary valid variable that is used to return an additional argument from parser method VariableDeclaratorId, since such an id may have a dimension */ private int tmpdimension Listing 3.5: Temporal variable that can be deleted Other support code has been erased because it was not necessary now or has been adapted to the new ANTLR parser generator [Listing 3.6]. The first function copyprefixinfo copy the relative position, start position and end position of an element of the AST. The function shifttoken creates an iterator that is called current. This iterator stops just in the token before the actual. Sometimes the token before it is a special token, as is the case of the comments, in that case the token before is the special token. Having the position of the actual token and the token before it is calculated the relative position to be set later on. private void copyprefixinfo(sourceelement oldresult, SourceElement newresult) newresult.setrelativeposition(oldresult.getrelativeposition()) newresult.setstartposition(oldresult.getstartposition()) newresult.setendposition(oldresult.getendposition()) private void shifttoken() if (current!= token) if (current!= null) while (current.next!= token) current = current.next Token prev if (token.specialtoken!= null) prev = token.specialtoken else prev = current if (prev!= null) int col = token.begincolumn - 1 int lf = token.beginline - prev.endline if (lf <= 0) col -= prev.endcolumn // - 1 if (col < 0) col = 0 position.setposition(lf, col) current = token Listing 3.6: Functions unused in the new implementation 8

13 In the grammar specification first appears the tokens specification for the scanner and then appears the rules for the parser. Special mention deserve three tokens because show how it is working the downward compatibility that we discussed before [Listing 3.3]. We can see if the version is old the kind of the token is changed to IDENTIFIER. Also we can see that there is something to do when we find [Listing 3.7]. < ASSERT: "assert" > if (!myparser.jdk1_4) matchedtoken.kind = IDENTIFIER < ENUM: "enum" > if (!myparser.jdk1_5) matchedtoken.kind = IDENTIFIER < AT: "@" > if (!myparser.jdk1_5) // TODO Listing 3.7: Code for downward compatibility One program in Java can be contained in one or more compilation units. In a compilation unit the first that can appear is a package name, but it is not necessary. Then a program can appear from zero to an unlimited number of import declarations. After that you can declare some types. As we can see in compilation unit rule are two options: one with package declaration and another without. This can be skipped using the EBNF operator? 1, reducing the code substantially [Listing 3.8]. We can also see at the beginning one of the quick fixes done in the grammar. CompilationUnit CompilationUnit() : // This is a quick "fix" - TypeDeclaration and PackageDeclaration unfortunately // can both start with an unlimited number of annotations. However, usually only one file // per package contains package annotations, so this is not a performance issue. (LOOKAHEAD(PackageDeclaration()) PackageDeclaration() (ImportDeclaration())* (TypeDeclaration())* (ImportDeclaration())* (TypeDeclaration())* ) Listing 3.8: Compilation Unit in the grammar 1 In JavaCC are used the brackets [ ] for representing the EBNF operator? 9

14 The types are the next ones: classes, interfaces, enumerations and annotations [9] [Listing 3.9]. In The Listing 3.9 we also can see code that can be removed, as is written in the comment. Next we can see the type declaration with the look ahead, this is not necessary in the new parser is LL(*). It will choose the best branch consuming the number of tokens necessary for guessing. TypeDeclaration TypeDeclaration() : ( LOOKAHEAD( ( "abstract" "final" "public" "strictfp" AnnotationUse() )* "class" ) result = ClassDeclaration()... "" ) if (result!= null) // may be removed as soon as Recoder fully understands Java5 setpostfixinfo(result) return result Listing 3.9: Type Declarations with deprecated code at the end In some parts of the grammar we can see code that has been commented out because the grammar has been changed. In this code is commented out the declaration of a constant inside of a annotationtypedeclaration [Listing 3.10]. // ConstantDeclaration /*LOOKAHEAD( (AnnotationUse() "public" "static" "final")* Type() VariableDeclarator(true)) (AnnotationUse() "public" "static" "final")* Type() VariableDeclarator(true) ( "," VariableDeclarator(true))* ""*/ Listing 3.10: Commented code that has not been removed In method declarations when the method has a generic type in the current grammar we can see a hack in the grammar. This hack is not needed you can take the first token that appear in the rule in order to set the prefix. This hack is used in more places that are using generic types and can be simply deleted. It will not be showed again but it appears several times in the constructor declaration for example [Listing 3.11]. 10

15 [ "<" if (ml.size() == 0) // '<' of MethodDeclaration is first element then. Need to store the result somewhere... dummy = factory.createpublic() setprefixinfo(dummy) /* HACK */ typeparams = TypeParametersNoLE() ] Listing 3.11: Generic parameters in a method declaration In the same rule, methoddeclaration, we find old code commented out that it is not necessary anymore and can be deleted [Listing 3.12]. th.setexceptions(trl) // Throws th = factory.createthrows(trl) result.setthrown(th) Listing 3.12: More out commented code The code it is completely full of out commented code as we can see and we found a lot of patches and code from the past commented, that do the code more unreadable and dirty. Another example of out commented code it is found inside of the rule formalparameter and we can see how is used the variable tmpdimension that is incremented inside of the rule variabledeclaratorid [Listing 3.13]. id = VariableDeclaratorId() dim = tmpdimension /*if (varargspec!= null) dim++*/ Listing 3.13: Another piece of out commented code There is full rule that has been commented out probably to improve the performance with or because the changes in the grammar when the new versions of Java where released [Listing 3.14]. /* ASTList<UncollatedReferenceQualifier> NameList() : ASTList<UncollatedReferenceQualifier> result = new ASTArrayList<UncollatedReferenceQualifier>() UncollatedReferenceQualifier qn qn = Name() result.add(qn) 11

16 ( "," qn = Name() result.add(qn) )* return result */ Listing 3.14: Entire rule commented out Inside of an expression we can see this comment that talk about one expansion for performance reasons and but also says that is a weakness of the grammar that should be solved [Listing 3.15]. /* * This expansion has been written this way instead of: * Assignment() ConditionalExpression() * for performance reasons. * However, it is a weakening of the grammar for it allows the LHS of * assignments to be any conditional expression whereas it can only be * a primary expression. Consider adding a semantic predicate to work * around this. */ Listing 3.15: Comment showing a weakness of the grammar We found more strange comments in the enumconstant rule. In this code is set the position of the start position and the end position of an EnumConstructorReference, and as we can see this can be done before in the grammar [Listing 3.16]. ref = factory.createenumconstructorreference(args, cd) setprefixinfo(ref) // TODO this maybe too late?! setpostfixinfo(ref) spec = factory.createenumconstantspecification(id, ref) setprefixinfo(spec) // TODO this maybe too late?! Listing 3.16: Suspicious comments that inform us of a possible change At the end of the class declaration we also found a comment that gives us a clue that something is not well done in that place. This comment is after set the end position of a class declaration that should be a brace [Listing 3.17]. result.setmembers(mdl) setpostfixinfo(result) // coordinate of ""?! return result Listing 3.17: Another suspicious comment 12

17 In this production of the grammar we can see the patch named before, at the end of a field declaration is added a semicolon all the times and is remarked with comments showing the patches. This patch is not only in the members of a class, it is also repeated in the declaration of all the members of an interface that can be a static block a nested declaration or a method declaration [Listing 3.18] MemberDeclaration ClassBodyDeclaration() : (... (FieldDeclaration() ("")*) // patch ) Listing 3.18: Semicolon patch after a member declaration We have found more out commented code in variabledeclaratorid, another time is about setting the position [Listing 3.19]. Identifier VariableDeclaratorId() :... setpostfixinfo(result) //setprefixinfo(result) return result Listing 3.19: Commented code in a rule In the shift expressions has been commented out code, to improve the performance of the JavaCC parser with some predicates [Listing 3.20]. This code is useless because it is not needed for recognize the token. Instead of the token >> is used a new rule called RSIGNEDSHIFT and instead of >>> is used a rule called RUNSIGNEDSHIFT. Expression ShiftExpression() : AdditiveExpression() ( ( "<<" // ">>" RSIGNEDSHIFT() // ">>>" RUNSIGNEDSHIFT() ) AdditiveExpression() )* Listing 3.20: Code useless in the new implementation Some parts of the grammar can be restructured in order to avoid some unnecessary checking errors after the scan. An example of this case appear in the try catch finally statement. After a try block has to appear at least one catch block or one finally. In the current grammar this fact has to be checked in the semantic analysis [Listing 3.21]. 13

18 /* * Semantic check required here to make sure that at least one * finally/catch is present. */ TryStatement "try" BLOCK ("catch" "(" FORMALPARAMETER ")" BLOCK)* ( "finally" BLOCK)? Listing 3.21: Current try statement specification in EBNF It can be solved very easy syntactically and it is not necessary to do it semantically [Listing 3.22]. TryStatement "try" BLOCK ( ("catch" "(" FormalParameter() ")" BLOCK)+ ("finally" BLOCK)? "finally" BLOCK ) Listing 3.22 Hypothetical solution to the problem showed before We found another hack related to another before [Listing 3.11] that has been added for insert the correct position of a method declaration [Listing 3.23]. This rule separates the symbol < from the rest of a type parameter in a generic type. That is done to manage the start position of a method declaration. Depending if the number of modifiers is zero the first element can be the symbol less. // HACK for handling position of methoddeclarations correctly ASTList<TypeParameterDeclaration> TypeParametersNoLE() : ASTList<TypeParameterDeclaration> res = new ASTArrayList<TypeParameterDeclaration>() TypeParameterDeclaration tp tp = TypeParameter() res.add(tp) ("," tp = TypeParameter() res.add(tp))* ">" return res Listing 3.23 Hack for handling position of method Declarations We can also see in the primaryexpression a lot of code that should be changed and suspicious comments [Listing 3.24]. We can see in the comments that something needs to be changed. The first comment with a lot of exclamation marks, probably remarks 14

19 that we are returning the result before the end of the function. The other comments talk about the types than should be in that position. Expression PrimaryExpression() :... return result //!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!... // the prefix MUST be a type expression!!!!! // we currently only create UncollatedReferenceQualifiers... // should be a FieldReference?... Listing 3.24: Strange comments in primary expression Inside of the production primarypreffix we found a rule that has been commented out probably because the grammar was changed [Listing 3.25]. // LOOKAHEAD(NonWildcardTypeArguments() "this" Arguments()) // NonWildcardTypeArguments() "this" /* Arguments() is a mandatory suffix here!*/ // // prefix.type = prefix.this // System.err.println("Ignoring NonWildcardTypeArguments!") // // Listing 3.25: Commented rule in primarypreffix Inside of the production statementexpression we can see the proof of another expansion. This production generates more than the legal Java for this kind of statements [Listing 3.26]. /* * The last expansion of this production accepts more than the legal * Java expansions for StatementExpression. This expansion does not * use PostfixExpression for performance reasons. */ Listing 3.26: Comment that shows us an expansion Inside of the rule forstatement we can see out commented code. This code was changed when enhanced for loop (for each) was introduced in the newer versions of Java. Now in the rule has been changed the size of the lookahead in order to predict the rule that has to be taken [Listing 3.27]. 15

20 // // result = factory.createfor() // setprefixinfo(result) // //"(" Listing 3.27: Code out commented inside the for loop 3.2 Conclusion In this section we have talked about the RECODER grammar. We have discovered a lot of old out commented code that needs to be cleaned. Also we have found comments that point out possible errors in the grammar. Some rules can be simplified and improve a bit the grammar. To conclude this section, we saw why it is necessary to change the grammar and clean up the code. 16

21 4 Description of the new implementation The new version of the parser for RECODER is based on a different parser generator called ANTLR. The syntax of this parser generator is similar to JavaCC but has some details different. The grammar is written in EBNF and accepts actions in the middle. In JavaCC, when some parts of the grammar are optional is expressed with square brackets instead of the normal notation that uses?. In ANTLR is more readable with the operator? that is a standard in this notation. The original grammar where we base our implementation on is taken from the official ANTLR webpage that contains a lot of grammar for different languages and purposes. The original grammar can be downloaded for free from this link [10]. 4.1 Analysis First of all, we are going talk about the changes introduced in the original grammar from the ANTLR webpage to adapt the grammar to our use with RECODER. Other changes have been introduced in order to clean up some rules or because are not useful. The first change introduced was in order to simplify the grammar. In the original grammar an import declaration is described as below [Listing 4.1] importdeclaration : 'import' ('static' )? IDENTIFIER '.' '*' '' 'import' ('static' )? IDENTIFIER ('.' IDENTIFIER )+ ('.' '*' )? '' Listing 4.1: Original import declaration in the ANTLR grammar As we can see, we can merge the two rules changing the + for an asterisk and deleting the first rule. The result is showed below [Listing 4.2] and the rule has been reduced by five lines and is clearer to read. importdeclaration : 'import' ('static' )? IDENTIFIER ('.' IDENTIFIER )* ('.' '*' )? '' Listing 4.2: Import declaration simplified 17

22 The next change in the original grammar is deleting a rule that is unused. After checking that the rule is never used and is not useful to us, we decide to delete it. Looking at the name of the rule it was supposed to be used in the import declaration above, but probably somebody changed the grammar and forgot to erase this rule [Listing 4.3]. qualifiedimportname : IDENTIFIER ('.' IDENTIFIER )* Listing 4.3: Rule deleted because is not used When a modifier appears (for example in a class declaration), the original grammar uses this production [Listing 4.4]. But in order to the grammar similar to the JavaCC grammar the asterisk has been taken out of this production and written outside of every rule that used modifiers. So this change does not modify the grammar just is written in a different way. modifiers : ( annotation 'public' 'strictfp' ) //* -> The Kleene closure is now outside of this rule Listing 4.4: Rule modified for similarity with the JavaCC grammar We have changed the explicit constructor invocation in order to do it equal to the JavaCC grammar. The grammar it was a bit different but it does not change the final result. We can see the original production here [Listing 4.5] explicitconstructorinvocation : (nonwildcardtypearguments)? //NOTE: the position of Identifier 'super' is set to the type args position here ('this' 'super' ) arguments '' primary '.' (nonwildcardtypearguments )? 'super' arguments '' Listing 4.5: Original explicit constructor invocation 18

23 We can see that the production it expresses is the same but it just a reordering of the grammar [Listing 4.6]. explicitconstructorinvocation returns [SpecialConstructorReference result] : (nonwildcardtypearguments )? 'this' arguments '' (expr = primarynosuper '.')? (nonwildcardtypearguments)? 'super' arguments '' Listing 4.6: Explicit constructor invocation in the final grammar Normally in a block statement you can declare local variables and classes as well as you can insert normal statements (including new blocks). In this sense the original grammar is less restrictive than the JavaCC grammar because you can declare interfaces inside a block that it is not allowed [Listing 4.7]. blockstatement : localvariabledeclarationstatement classorinterfacedeclaration statement Listing 4.7: Original block statement in the grammar The grammar has been changed so the second production has a normal class declaration instead of class or interface declaration in order to keep closer to the JavaCC specification and save a semantic checking [Listing 4.8]. blockstatement returns [Statement result] : localvariabledeclarationstatement normalclassdeclaration statement Listing 4.8: Changed block statement We can see now how is solved the problem that we pointed before [Listing 3.23] and the possible solution in [Listing 3.24]. The problem is fixed in a way very similar to the way that we exposed before [Listing 4.9]. Always appear even a catch block of a finally block. trystatement : 'try' block ( catches 'finally' block catches 'finally' block ) Listing 4.9: Solution of the problem in the try statement But we have more to say about the try statement. The original grammar introduced catches as a list of catch blocks [Listing 4.10] but it is not so comfortable for the 19

24 translation for the current JavaCC grammar to the new one. So some changes were introduced [Listing 4.11]. trystatement : 'try' block ( catches 'finally' block catches 'finally' block ) catches : catchclause (catchclause )* catchclause : 'catch' '(' formalparameter ')' block Listing 4.10: Original try-catch-finally specification Instead of this the production catches was deleted and changed the name of the catch clause to catch block. Also has changed the rule formal parameter for normal parameter declaration that contains the same and because of that it can be deleted formal parameter because it is not used in other places in the grammar. The grammar result looks like this. trystatement : 'try' block ( catchblock+ 'finally' block catchblock+ 'finally' block ) catchblock : 'catch' '(' normalparameterdecl ')' block Listing 4.11: Modified Try catch block. There is one little change in the enhanced for loop statement in order to be closer to the JavaCC specification [Listing 4.12]. forstatement : // enhanced for loop 'for' '(' variablemodifiers type IDENTIFIER ':' expression ')' statement 20

25 // normal for loop 'for' '(' (forinit )? '' (expression )? '' (expressionlist )? ')' statement Listing 4.12: Little change in the for loop Instead of the variable modifiers the type and the identifier in the rule now we have the for loop initialization, the same as the normal for loop. The biggest change realized in the grammar is about the primary expressions. In the ANTLR grammar the primary expressions were difficult to translate from the JavaCC grammar to the ANTLR grammar. So it was to take the JavaCC grammar and translate this part to the ANTLR syntax. After that we have to be sure that the meaning of the grammar does not change. So it has to generate the same AST that with the grammar before. In order to understand the equivalence between both expressions, we have to say that in the original ANTLR grammar the production that contains primary is a primary expression followed for an arbitrary number of selectors. (primary selector*). In the final specification the grammar is like this: primaryprefix selector*. primary : parexpression 'this' ('.' IDENTIFIER)* (identifiersuffix)? IDENTIFIER ('.' IDENTIFIER)* (identifiersuffix)? 'super' supersuffix literal creator primitivetype ('[' ']')* '.' 'class' 'void' '.' 'class' primaryprefix : parexpression 'this' IDENTIFIER ('.' IDENTIFIER)* ( ('[' ']')*'.' 'class')? 'super' '.' IDENTIFIER literal creator primitivetype ('[' ']')* '.' 'class' 'void' '.' 'class' Listing 4.13 Comparison between primary prefixes before and after the change The code before is trying to match the differences between the original ANTLR grammar and the final result. The original ANTLR is on the left side and the final one is in the right side [Listing 4.13]. At the beginning of the rules we can see that there are similarities. The par expression is still the same however the expression that start with this have something more after. It happen the same at the end of the identifier expression and the super expression. The rest is the same than before. But as we can see we have to introduce more changes in order to leave the grammar unaltered. The old grammar difficult to follow because mixes the prefix with the suffix in a primary expression. The identifier and super suffix are inside of the primary prefix so it 21

26 is easier and clearer to put it inside of selector. Also it is better for the translation later on because it is closer to the JavaCC specification. supersuffix : arguments '.' (nonwildcardtypearguments)? IDENTIFIER (arguments)? identifiersuffix : ('[' ']')+ '.' 'class' ('[' expression ']')+ arguments '.' 'class' '.' nonwildcardtypearguments IDENTIFIER arguments '.' 'this' '.' 'super' arguments innercreator selector : selectornosuper '.' 'super' selectornosuper : '.' IDENTIFIER arguments '.' 'this' '.' creator '[' expression ']' '.' nonwildcardtypearguments IDENTIFIER selector : '.' IDENTIFIER (arguments)? '.' 'this' '.' 'super supersuffix innercreator '[' expression ']' Listing 4.13 Comparison between primary suffixes before and after the change The code of the left is the original grammar that is bit difficult to understand. As selector can be repeated all the times that we want, we can repeat the selector rule from the right side to generate the same code than the grammar from the left. It is difficult to see that both grammars generate the same language, however the grammar is equivalent. So the inner creator, identifier suffix and super suffix are moved of the grammar without any effects in the final result. We can appreciate that in the right hand side of Listing 4.13 exists a production selectornosuper. This is a change introduced to the grammar in order to control the use of the predicate. super that before was controlled by a boolean variable [Above Listing 3.2]. It is introduced a new rule selectornosuper without the postfix super and a primarynosupper that repeat this selector instead of the normal when we are inside of explicitconstructorinvocation. Continuing with the changes in the original grammar, the production classcreatorrest thas is showed below in Listing 4.14 has been replaced for his content in creator. 22

27 creator : 'new' nonwildcardtypearguments classorinterfacetype classcreatorrest 'new' classorinterfacetype classcreatorrest arraycreator classcreatorrest : arguments (classbody)? Listing 4.14 Eliminated rule classcreatorrest Finally with grammar we can add some general simplifications to the grammar. When we find something like A(A)* we can change this to A+. Sometimes it is better to leave the grammar but the same but sometimes we can save space and simplify a bit the code. In the next paragraphs of this section we discuss the differences between the current grammar used in JavaCC and the new one for ANTLR with the changes explained before in this section. Comparing the compilation unit in the new ANTLR grammar [Listing 4.15] with the compilation unit in the JavaCC grammar we can see that they are very close to each other. The unique difference between both is that the annotations are out of the package declaration and the two rules are mixed in one as we explain above of the Listing 3.8. compilationunit returns [CompilationUnit result] : ((annotations)? packagedeclaration)? (importdeclaration)* (typedeclaration)* Listing 4.15: Compilation unit in the new grammar Besides the annotations inside of the package declaration, the rules are the same, also the import declarations. In the type declarations we can see the first big difference between both grammars. They are just ordered in other way with different rules. In the JavaCC grammar we can see that the type declaration it is directly a class, an interface, an enum or an annotation. ANTLR grammar it is a bit more complex, one type can be a class declaration or an interface declaration. As we said before an enum is a special type of class so in the class declaration we have the normal class declaration and the enum declaration. In the interface declaration it happens the same, we have the normal interface declaration and the annotation type [Listing 4.16]. typedeclaration returns [TypeDeclaration result] : res = classorinterfacedeclaration result = res '' 23

28 classorinterfacedeclaration returns [TypeDeclaration result] : classdeclaration interfacedeclaration classdeclaration returns [TypeDeclaration result] : normalclassdeclaration enumdeclaration interfacedeclaration returns [TypeDeclaration result] : normalinterfacedeclaration annotationtypedeclaration Listing 4.16: Type declaration in the ANTLR grammar Type declarations Between the current annotation-type declaration and the new one there are a few things different than before. The new grammar separates the body of the annotation type in a different rule and inside the different members in a new rule as well. In this way, everything is more ordered but maybe it is less efficient talking in terms of space and time. With the enumerations it happen the same than with the annotations, the rule is almost equal to the current grammar but it has one more intermediate rule in the body and other with the list of constants. The classes have the same structure if we forget that in the JavaCC version the modifiers are outside for distinguishing the local classes from the normal classes. In the new grammar we use the same production for both, but it can be changed easily if you want to maintain the same structure as before. We can see more things like the classes extend one type and implement a type list. In the JavaCC grammar is used typed name. However, it exists another production in the old grammar that is called type, because of that is bit confusing with the name of the rules between each other. So we need how are called the rules in both grammars. In the actual grammar type can be a primitive type or a complex type created by the user: class, interface, enum or annotation. type : classorinterfacetype ('[' ']')* primitivetype ('[' ']')* createdname : classorinterfacetype primitivetype classorinterfacetype : IDENTIFIER (typearguments)? ('.' IDENTIFIER (typearguments)?)* 24

29 primitivetype : 'boolean' 'char' 'byte' 'short' 'int' 'long' 'float' 'double' Listing 4.17: Type in the new grammar In the old grammar [Listing 4.18] type can be a typed name that corresponds with class or interface type in the new one [Listing 4.17], or can be a raw type. The created name is only used when we are creating a new array [Listing 4.28]. A raw type can be a primitive type (this one is the same) or a name. However, a name is the same as a typed name but without type arguments (is optional in that production). Name could be deleted but as we can see in the comments is used also in the import declarations, so that it is not an option. If we delete name from raw type, probably we will obtain the same result because we can obtain the type from typed name and the structure would be the same as in the new grammar but with different names in the rules. Type() : TypedName()( [ ] )* RawType() TypedName() : <IDENTIFIER> [TypeArguments()] (. <IDENTIFIER> [TypeArguments()] )* RawType() : ( PrimitiveType() Name() ) ( [ ] )* PrimitiveType() : boolean... double UncollatedReferenceQualifier Name() : 25

30 /* * A lookahead of 2 is required below since Name can be followed * by a.* when used in the context of an ImportDeclaration. */ <IDENTIFIER> (. <IDENTIFIER>)* Listing 4.18 Types in the JavaCC grammar Now we are going to talk about the member in the classes and interfaces. The unique significant variation in this is that the methods and the constructor are in the same rule, not as in the JavaCC grammar [Listing 4.24]. methoddeclaration : /* For constructor */ (modifiers)* (typeparameters)? IDENTIFIER formalparameters ('throws' qualifiednamelist)? '' (explicitconstructorinvocation)? (blockstatement)* '' /* For methods */ (modifiers)* (typeparameters)? (type 'void') IDENTIFIER formalparameters ('[' ']')* ('throws' qualifiednamelist)? ( bod = block '' ) Listing 4.19: Method and constructor declaration in one rule In the new grammar there is a new rule for the methods in the interface not allowing body in the methods declarations. The old grammar allows declaring the body and now in the new grammar this is checked syntactically. In the interfaces you only cannot declare the body of the methods. The annotations also have other type of method declarations because you can fix a default value for the methods. There are a lot of small details that are different in both grammars but the expressions are exactly the same if we skip the primary expressions. Inside of the primary expressions are some differences. The first difference that comes out is between the literals, to be more exact in the floating point literals. The old grammar only has one token for this kind of numbers and differentiates if is a double or a float inside of the production [Listing 4.25] but the new grammar has two different tokens for both constants [Listing 4.26]. 26

31 Literal Literal() : <FLOATING_POINT_LITERAL> if (token.image.endswith("f") token.image.endswith("f")) result = factory.createfloatliteral(token.image) else result = factory.createdoubleliteral(token.image) setprefixinfo(result)... Listing 4.20: Floating point literals in the JavaCC grammar literal : FLOATLITERAL result = factory.createfloatliteral($floatliteral.gettext()) DOUBLELITERAL result = factory.createdoubleliteral($doubleliteral.gettext()) Listing 4.21: Floating point literals in the ANTLR grammar There is one big difference in the allocation expressions. It is totally in different order. In the old grammar [Listing 4.27] there are two rules but there is not separation between the arrays allocation and the allocation of normal classes. TypeOperator AllocationExpression() : ( "new" PrimitiveType() ArrayDimsAndInits() "new" TypedName() [NonWildcardTypeArguments()] ( Arguments() [ClassBody()] ArrayDimsAndInits() ) ) ArrayDimsAndInits() : ( ("[" Expression() "]")+ ( "[" "]")* ( "[" "]" )+ ArrayInitializer() ) Listing 4.22 Allocation expression in the old grammar 27

32 In the new grammar we can see that is separated very clear the normal allocations and the array allocations. The grammar is more ordered and easier to use for RECODER. creator : 'new' classorinterfacetype ( nonwildcardtypearguments )? arguments (classbody )? arraycreator arraycreator : 'new' createdname ('[' ']')+ arrayinitializer 'new' createdname ('[' expression]')+ ('[' ']')* Listing 4.23 Allocation expressions in the new grammar called creator 4.2 Conclusion In this section we have showed the changes that we have introduced in the original ANTLR grammar to adapt it to the RECODER grammar specification. Also we have compare the RECODER grammar with the new ANTLR grammar, talking about what has come out from the grammar and what has come in to the new implementation. 28

33 5 Evaluation In this section of the thesis we evaluate all the work done. In the first subsection we talk about the compatibility with the old version and some fails that have been encountered. In the second subsection we talk about the maintainability of the program although is difficult to measure. 5.1 Compatibility RECODER has incorporated a test suite for checking if everything is working properly and there are no fails. The suite contains 88 test cases that our implementation should pass. Unfortunately, our implementation does not pass all the tests, but on the other hand we know where the problems are and actually all are related to only one issue. RECODER keeps position of the tokens saving their start position, the end position and the relative position with the last token before the actual. This is not properly supported in new ANTLR specification yet. This position is the cause of all the errors that we find in the test suite. The first error appears in the parserprintertest. This test parses a program and then unparses it and saves the code in a string. After that it is parsed again and compared with the string saved again. The result should be the same but it is not because the position of the tokens is not good fixed. The structure of the program is maintained but it is not exactly the same because are added some white spaces and fail but the code is the same. The second test that fails is called testanalysisreport and fails because of the same reason as before. The relative positions of the tokens are bad inserted and when the code is unparsed are inserted a lot of white spaces in the middle. This method compare if the sizes of two buffers are equal and this is not the case because of the white spaces that are inserted between the tokens. The third fail appears because ModelRebuildTest extends the testanaysisreport class. It is because of that, that the same fail appear again when the test call the father and run the test that fail again for the same reason. The comments in RECODER are attached to the element in the AST closest to the comment. That means that the comments are not elements by themselves but are associated attributes to the elements in the AST. This causes some errors related to the comments. Since the relative position is bad set, the comments are set in the bad position as well. The forth fail happens with single line comments for the reasons explained before in the test testsinglelinecommentbug. In the last test that fails, it occur the same error again because of the same reason with the bad positioning of the comments. This test is called testcommentattachment the position is bad set and the comments are inserted in the bad position. 29

Building Compilers with Phoenix

Building Compilers with Phoenix Parser Generators: ANTLR History of ANTLR ANother Tool for Language Recognition Terence Parr's dissertation: Obtaining Practical Variants of LL(k) and LR(k) for k > 1 PCCTS: