Programming Assignment in Semantics of Programming Languages

Programming Assignment in Semantics of Programming Languages The programming assignment ( lab ) is to implement a byte code interpreter and verifier for a simple object oriented byte code language which could be used as the target for a compiler of an object oriented language. 1 The assignment is inspired by the article Typing a Multi-Language Intermediate Code by Andrew D. Gordon and Don Syme about the byte code verifier for the intermediate language in Microsoft s.net framework. The purpose of the assignment is to confront you with a formal description of a language whose semantics and type system is not immediately obvious from the syntax. It is assumed that you have knowledge of grammars (to understand the specification of the syntax) and operational semantics definitions (to understand the semantics of the machine). The assignment is divided into two parts. Part I is the interpreter and part II is the byte-code verifier. 1 A Byte Code Interpreter 1.1 The Language The language is a simple object oriented language. A program consists of a set of declarations which in the concrete syntax are separated by semicolons. Programs P ::= D 1 ;... ; D n Each declaration D i is either a type signature or a class definition. The type signatures have no effect whatsoever on the execution of the program and they are there only to make it easier for you to implement the byte code verifier in part II. We will thus ignore them for the moment and move on to the class definitions. The syntax of the type signatures is given in Appendix.1 for those of you who wants to implement your own parsers. A class definition has the following syntax. Class class ::= class cname = fname 1 ;... ; fname n ; mname 1 = A 1 ;... ; mname n = A n where cname, fname and mname ranges over names of classes, fields and methods. The class definition specifies that an object of this class will have the fields fname 1 ;... ; fname n. Each of these fields can hold a 32-bit word which may be considered as an integer or as an address. The class definition also specifies that an object of this class will have the methods mname 1 ;... ; mname n with method bodies A 1,..., A n respectively. All the names of classes in a program must be distinct and the field names and the method names must be distinct within each class. However, the same method names and field names may occur in several classes. Every program must have a special class called Main with a field io and a method main. When a program is run an object of class Main will be created with the field io set to some input and the method main will be called on this object. When a program terminates the value of io will be considered as the result of the program. Based on a lab assignment originally written by Jörgen Gustavsson, 2002 for the course Semantics of Programming Languages. Modified by David Sands, 2012 1 A realistic byte code language for this purpose would have to be more complex. 1

The body of a method consists of a sequence of byte code instructions. The form of these instructions is given by the following grammar. Instr. seq. A, B ::= I 1... I n Instructions I ::= l : δ δ ldc n discard n m ldstack n ststack n ldfld fname stfld fname add sub equal leq jump l brtrue l brfalse l newobj cname call mname Note that instructions are very simple and that no instructions have sub instructions in contrast to commands and expressions which usually have sub commands and sub expressions. This makes it easy to code a sequence of instructions as a byte array which is the reason for why it is called a byte code language. It also makes it easy to write a parser for the language and the concrete syntax is the same as the abstract syntax. Note that there is no return instruction. Instead a method returns when it reaches the end of the instruction sequence. Instructions of the form l : δ δ are called labels and they are not instructions in the normal sense when execution passes a label nothing happens. Labels are instead used to mark points in the instruction sequence which you can jump to with a jump or branch instruction. All the labels within a method must be distinct and it is not possible to jump to a label which occurs in another method. All labels are annotated with a type δ δ which is there to make it easier for you to implement the byte code verifier. They play no rôle in the execution of the program so you may ignore them for the moment. Most instructions operate on the stack. For example the instruction add will take two elements from the stack, add them and store the result on top of the stack. The stack is also used for local variables, for parameters to methods and for results from methods. We will use S to range over stacks, ɛ to denote an empty stack and n S to mean that n is on top of S. The stack contains 32-bit words, which may be considered as integers or as addresses. We will write S for the number of elements in a stack and S(n) for the n th element on the stack. We number the elements from the top and downwards starting from 0. So S(0) is the top of the stack, S(1) is element below and so on. Similarly we write S[n m] for updating the n th element of S with m. Conceptually, an object of a class is a record with fields and methods as specified by the class definition. The fields of an object may be updated but the methods may not be changed by the execution so we can represent an object as a pair of the name of its class and the fields as follows. Objects O ::= (cname, fields) When a method is called on an object we use the class name to get hold of the right method body. We model the fields of an object by a partial function from the appropriate field names to 32-bit words. When an object is created space is allocated on the heap which we model as a partial function from 32-bit words to objects. We use H to range over heaps. Objects are passed by reference, i.e., if we for instance want to pass an object to a method we store the 32-bit address of the object on the stack before we call the method. Configurations are triples A, S, H of an instruction sequence, a stack and a heap. We will use C to range over configurations. The judgements of our big-step semantics is of the form P, B C C where P is the program (i.e., a set of declarations), B is the body of the method that is currently executing, C is the initial configuration and C is the terminal configuration. The program P is used to access the class definitions which is needed when a new object is created and when a method is called. The body B of the currently executing method is used when a jump or branch instruction is executed. Then we use the notation B(l) to denote the instruction sequence starting directly after the label l. The initial configuration when we run a program with input n is ldc n newobj Main call main, ɛ,. 2

ldc P, B A, n S, H ɛ, S, H P, B ldc n A, S, H ɛ, S, H discard P, B A, n 1... n i S, H ɛ, S, H P, B discard i j A, n 1... n i m 1... m j S, H ɛ, S, H P, B A, n S, H ɛ, S, H ldstack P, B ldstack i A, S, H ɛ, S, H if S > i and S(i) = n P, B A, S[i n], H ɛ, S, H ststack P, B ststack i A, n S, H ɛ, S, H P, B A, n 2 S, H ɛ, S, H add P, B add A, n 1 n 0 S, H ɛ, S, H if S > i if n 2 = n 0 + n 1 P, B A, n 2 S, H ɛ, S, H sub P, B sub A, n 1 n 0 S, H ɛ, S, H P, B A, n 2 S, H ɛ, S, H equal P, B equal A, n 1 n 0 S, H ɛ, S, H P, B A, n 2 S, H ɛ, S, H leq P, B leq A, n 1 n 0 S, H ɛ, S, H if n 2 = n 0 n 1 where n 2 = where n 2 = { 1 if n 0 = n 1 0 otherwise { 1 if n 0 n 1 0 otherwise Figure 1: Evaluation rules for stack manipulation. The instructions will create an object in class Main and then call the method main. The rules defining P, B C C are given in Figure 4, 5 and 6. 2 A Byte Code Verifier Your task in part II of the computer assignment is to implement a byte code verifier. The language in question is the very same as in part I. The verifier is based on a simple type system for the byte code. The verifier should accept a program if and only if it is well typed. If a program is ill typed then the verifier should give an appropriate error message. If a program passes the verifier then it is guaranteed (unless I have made some mistake) to not cause certain run time errors. Among other things it is supposed to guarantee that an integer is never used as a reference to an object in the heap, references to objects are not used in arithmetic operations, and there are always enough elements on the stack to execute instructions that manipulate the stack. However even if a program passes the verifier it may happen that the instruction call mname causes a runtime error because there is no method with the name mname in the object that the instruction operates on, and 3

ɛ P, B ɛ, S, H ɛ, S, H label P, B A, S, H ɛ, S, H P, B l : δ δ A, S, H ɛ, S, H P, B A, S, H ɛ, S, H jump P, B jump l A, S, H ɛ, S, H P, B A, S, H ɛ, S, H brtrue P, B brtrue l A, n S, H ɛ, S, H P, B A, S, H ɛ, S, H brfalse P, B brfalse l A, n S, H ɛ, S, H if l dom(b) and B(l) = A if l dom(b) and A = if l dom(b) and A = { B(l) if n 0 A otherwise { B(l) if n = 0 A otherwise P, A j A j, n S, H ɛ, S, H P, B A, S, H ɛ, S, H call P, B call mname A, n S, H ɛ, S, H if n dom(h), H(n) = (cname, fields), class cname = fnames; mname 1 = A 1 ;... ; mname i = A i P and mname = mname j Figure 2: Evaluation rules for control flow. P, B A, m S, H ɛ, S, H if n dom(h), ldfld H(n) = (cname, fields), P, B ldfld fname A, n S, H ɛ, S, H and fields(fname) = m P, B A, S, H[m (cname, fields[fname n])] ɛ, S, H stfld P, B stfld fname A, m n S, H ɛ, S, H if m dom(h) and H(m) = (cname, fields) P, B A, m S, H[m (cname, fields)] ɛ, S, H newobj P, B newobj cname A, n 1... n i S, H ɛ, S, H ( ) ( ) if m dom(h), class cname = fname 1 ;... ; fname i ; methods P and fields = [fname 1 n 1,..., fname i n i ] Figure 3: Evaluation rules for object manipulation. 4

the instructions ldfld fname and stfld fname cause a runtime error because there is no field with the name fname in the the object that the instructions operate on. If you want an additional challenge you may extend the type system such that it can prevent the second kind of error. If you want to do that then pass by my office and I will give you a hint on how to do it. Contact me when you are ready with the assignment so that we can agree on the time for a meeting where you can demonstrate to me what you have done. I have made some test cases available from the home page which you can use to test your verifier. I would like you to finish both parts of the computer assignment at the latest in the first week of the next study period. However, as you might guess, I would recommend you to do it earlier. 2.1 The Type System 2.2 Types At the core of the type system are the two base types which are int and object. These are used to give a type to the individual fields in an object and to the values stored on to the stack. The type of a stack take the following form. The idea is that a stack with the type Base Types ρ ::= int object Stack Types δ ::= ρ 1... ρ n where n 0 ρ 1... ρ n has at least n elements, that the top of the stack has base type ρ 1, that the second element has base type ρ 2, and so on. We will write δ 0 δ 1 for the concatenation of two stack types, δ for the length of the stack type, δ(i) for the i th element and δ[i ρ] for the stack type where the i th base type of δ has been updated with ρ. The operations are defined as follows. (ρ 1... ρ n ) (ρ n+1... ρ n+m ) = ρ 1... ρ n+m ρ 1... ρ n = n (ρ 1... ρ n )(i) = ρ i+1 if 0 i n 1 (ρ 1... ρ i+1... ρ n )[i ρ] = (ρ 1... ρ... ρ n ) if 0 i n 1 Note that δ(0) refers to the first base type, δ(1) to the second and so on. We give a sequence of byte code instructions a type of the form δ δ. The intuitive meaning of the type is that the instruction sequence requires that the stack before the sequence is executed has type δ. And, if that is the case, then the stack after the sequence has been executed has the type δ. For example the instruction sequence has the type add add int int int nil int nil because it requires that there are at least three integers on top of the stack and after its execution the three integers have been replaced with one integer. 2.3 Type Signatures and Annotations The byte code is augmented with type signatures and type annotations to make it easier for you to implement the verifier. If you want a challenge then you may try to ignore them and infer them automatically. 5

There are two kinds of type signatures. Field signatures take the form field fname : ρ and specifies that the field fname has the base type ρ. I.e., it specifies whether the field stores an integer or a reference to an object. The field signatures is a top level declaration and it applies to all classes. Method signatures take the form method mname : τ and specifies that the method mname has the method type τ. Method signatures are also top level declarations and apply to all classes. Method types are of the following form. Method Types τ ::= ρ τ (ρ 1,..., ρ n ) where n 0 In the concrete syntax one may omit the parentheses in (ρ 1,..., ρ n ) if n = 1. I.e, one may write ρ instead of (ρ). Finally labels in the instruction sequence are annotated with a type δ δ which is the type of the instruction sequence following the label. The entire syntax of the language is given for reference in Appendix.1. 2.4 Typing Rules The type system have two kind of typing judgements, one for typing instruction sequences and one for typing methods. Typing Instruction Sequences The typing judgements for instruction sequences take the form Γ, A : δ δ and should be read out as the instruction sequence A has the type δ δ in the context of Γ,. The context Γ contains global typing information about the type of all fields and methods. For each class in the program there is also a tuple of base types which are the types of its fields. Formally we model Γ as a partial functions which maps field names to base types, method names to method types and class names to tuples of base types. The context contains local typing information about the labels in the method currently typed. Formally it is a partial function from label names to types of the form δ δ. The information in origins from the type annotations on labels in the input program. The typing rules for instruction sequences are given in Figure 4, Figure 5 and Figure 6 and we leave them without further comments. Typing Methods The second kind of typing judgements are for typing methods. The judgement take the form Γ mname = A wt and should be read as the method mname with the body A is well typed in the context Γ. The only rule (: method ) for typing methods is as follows. Γ, A : object ρ 1... ρ n ρ 1... ρ m l dom(a).γ, A(l) : (l) Γ mname = A wt if Γ(mname) = ρ 1 ρ n (ρ 1,..., ρ m) and l dom(a). (l) = typeat(a, l) The first premise of the rule specifies that for a method to be well typed its body naturally has to have a body whose type correspond to the type that has been declared for the body. I.e., if the type of the method is declared to be ρ 1 ρ n (ρ 1,..., ρ m) then the instruction sequence in the body has to have the type object ρ 1... ρ n ρ 1... ρ m. Note that we require that the body expects an object on top of the stack. This is because when a method is called there is always a reference to the concerned object on top of the stack. The second premise of the rule specifies that for each label in the method it should be the case that the instruction sequence following the label has the type that it is annotated with. There we 6

: ldc Γ, ldc n A : δ δ : discard Γ, A : ρ 1... ρ i δ δ Γ, discard i j A : ρ 1... ρ i ρ i+1... ρ i+j δ δ : ldstack Γ, A : ρ δ δ if δ > i and δ(i) = ρ Γ, ldstack i A : δ δ : ststack Γ, A : δ[i ρ] δ if δ > i Γ, ststack i A : ρ δ δ : add Γ, add A : int int δ δ : sub Γ, sub A : int int δ δ : equal Γ, equal A : int int δ δ : leq Γ, leq A : int int δ δ Figure 4: Typing rules for stack manipulating instructions. 7

: ɛ Γ, ɛ : δ δ : label if (l) = δ δ Γ, l : δ δ A : δ δ : jump if (l) = δ δ Γ, jump l A : δ δ : brtrue Γ, A : δ δ if (l) = δ δ Γ, brtrue l A : int δ δ : brfalse Γ, A : δ δ if (l) = δ δ Γ, brtrue l A : int δ δ : call Γ, A : ρ 1... ρ m δ δ Γ, call mname A : object ρ 1... ρ n δ δ if Γ(mname) = ρ 1 ρ n (ρ 1,..., ρ m) Figure 5: Typing rules for control flow instructions. : ldfld Γ, A : ρ δ δ if Γ(fname) = ρ Γ, ldfld fname A : object δ δ : stfld Γ, A : δ δ if Γ(fname) = ρ Γ, stfld fname A : object ρ δ δ : newobj Γ, A : object δ δ if Γ(cname) = (ρ 1,..., ρ n ) Γ, newobj cname A : ρ 1... ρ n δ δ Figure 6: Typing rules for object manipulating instructions. 8

use the notation typeat(a, l) to denote the type annotation on l. We also use dom(a) for the set of labels in A and A(l) for the instruction sequence that follows after l. We are now ready to specify what is required for a whole program to be well typed. Definition 1 A program P is well typed if and only if the field io is declared to have type int, the method main is declared to have type (), the class Main is declared to have exactly one field io and exactly one method main, and all methods in all classes are well typed in a context Γ such that Γ(fname) = ρ if and only if field fname : ρ P Γ(mname) = τ if and only if method mname : τ P Γ(cname) = (ρ 1,..., ρ n ) if and only if classcname = fname 1 ;... fname n ; methods P and i {1,..., n}.field fname i : ρ i P..1 Full Syntax The entire syntax of the language, including type signatures, is given in this appendix. The rôle of the type signatures will be explained in the instructions for part II of the computer assignment. What is given below is the abstract syntax but the concrete syntax of the language only differs in two aspects. Firstly, the symbol in the abstract syntax is replaced with a. in the concrete syntax. Secondly, the syntax specifies that a method type may be of the form (ρ 1,..., ρ n ). If n = 1 then the concrete syntax allows us to omit the parentheses and write ρ instead of (ρ). Base Types ρ ::= int object Method Types τ ::= ρ τ (ρ 1,..., ρ n ) where n 0 Stack Types δ ::= ρ 1... ρ n where n 0 Programs P ::= D 1 ;... ; D n Declarations D ::= method mname : τ field fname : ρ class cname = fname 1 ;... ; fname n ; mname 1 = A 1 ;... ; mname n = A n Instr. seq. A, B ::= I 1... I n Instructions I ::= l : δ δ ldc n discard n m ldstack n ststack n ldfld fname stfld fname add sub equal leq jump l brtrue l brfalse l newobj cname call mname 9