The Design of Core C++ (Notes)

The Design of Core C++ (Notes) Uday Reddy May 13, 1994 This note is to define a small formal language called Core C++ which reflects the essential structure of C++. As the name implies, the design only captures the core language, not the bells and whistles. Moreover, Core C++ deviates from the real C++ in certain ways in order to streamline the design as well as to disambiguate the semantics. Some of these deviations are only formal in that they are useful for describing the semantics, but they may not be reflected concretely in the real language. 1 Type Structure of C The type structure of C seems to be a three-layered system: Data types are types of values that can be stored in variables (denoted by schematic variable δ), Storage types are types of storage objects (denoted by schematic variable S), and Types in general (denoted by schematic variable T). Data types only appear as parts of other types. On the other hand, storage types and general types occur in declarations. Typically, the former occur in storage declarations and the latter in parameter declarations. Data types The data types are defined by δ ::= int T void int is representative of the plethora of data types found in C (short, float etc.). T stands for pointers to T-typed values. Note that the destination of a pointer can be of any type, not only a data type. Finally, void is a paradoxical type of no value. Storage types A good sample of storage types is given by S ::= δ var S[ ] struct {S 1 x 1 ;...S n x n ; The storage type δ var stands for δ-typed variables, but var is usually not written, i.e., δ var as a storage type is simply written δ. (More on this below.) 1

The storage type S[ ] stands for arrays of S-typed storage. (This type has no relation to pointers.) struct {S 1 x 1 ;...S n x n ; stands for structures with components x 1,...,x n of storage types S 1,...,S n respectively. Types General types are given by the syntax T ::= δ val S S const δ(t 1,...,T n ) δ val is the type of values of type δ. The suffix val is for disambiguation. It is never written in concrete C. Every storage type S is a type. S suffixed with const is the type of constant structures of type S. This is a promise by the programmer to never assign to any component of the storage structure. δ(t 1,...,T n ) stands for functions that take arguments of types T 1,...,T n (for n 0) and return results of type δ val. It is significant that the arguments can be of any type, but the result can only be a data value. We are keeping with the C philosophy of being a low-level language [1, pp. 1-2]. In practice, this is not a serious limitation because T is a data type for any type T. One might wonder if functions can return arbitrary typed results (instead of only data values). In general, this would involve heap storage with garbage collection. In my opinion, this would be a radical extension of C. Remarks 1. An important idea in the above treatment is that variables are primitive storage objects and other storage structures are built from them. For example, the storage declaration int a[10]; declares a to be an array consisting of 10 integer variables, not an array variable with an array value of 10 integers. The same remark applies to structs. Storage structures of this kind cannot be passed by value basically because there is no such thing as the value of an array or a struct. Warning: Real C does not agree with this view for structs, though it agrees for arrays. 2. A function taking a parmeter of a storage type takes the entire storage structure. In other words, it corresponds to call by reference. 1 Parameters of type δ val, however, are passed by value. 3. This three-layered type system is in contrast to the two-layered system of data and phrase types advocated by Reynolds for Algol-like languages [2]. The reason for the additional layer of storage types is that values of these types have automatic storage creation mechanisms. This is not the case for functions. 1 This is not strictly C, but C++. In C, function types are of the form δ(δ 1,..., δ n), i.e., arguments can only be data values. 2

4. Note that general types occur in only two places: (i) as parameter types of functions and (ii) as destination types of pointers. 5. The convention of abbreviating both δ var and δ val as simply δ leads, not surprisingly, to an ambiguity. All real-life versions of C use the following convention: in a function-parameter position, δ means δ val and, in a pointer destination position, δ means δ var. (Note that these are the only two places where general types are used.) In Standard C, the effect of δ var parameters is obtained by address-and-pointer mechanism. In C++, however, a notion of references is introduced to circumvent the ambiguity (and then arbitrarily generalized). The effect of δ val is obtained by δ const (which is really δ var const a rather circumlocutary expression!) Type compatibility Type compatibility for struct types is by name and, for others, it is by structure. By name means that struct types are given names by unique type definitions of the form struct t {S 1 x 1 ;...S n x n ; Then, two struct types are considered equal if only if they have the same name. Type compatibility is not strictly equality. It is governed by a subtype relation. Primitive data types have a standard subtype relation, e.g., int <: float. For pointer types, we have T <: T iff T <: T. Storage types have trivial subtype relation, i.e., a storage type is only a subtype of itself. For general types, we have a subtype relation given by the following rules: 2 Enhancements of C++ C++ extends structs as follows: δ <: δ δ val <: δ val S <: S const δ <: δ T 1 <: T 1... T n <: T n δ(t 1,...,T n ) <: δ (T 1,...,T n) S ::=... struct : t {T 1 x 1 ;...T n x n ; This notation extends C structs in two ways. First, the struct-type is declared as a subtype of another struct-type t. Therefore, one has a subtype relation: struct : t {T 1 x 1 ;...;T n x n ; <: t Second, the members of the struct are allowed to be of any type (including function types). The functions in a struct are called member functions or methods. They are deemed to have an implicit parameter t this where t is the name of the struct type. As usual, structs are always introduced via definitions. Such definitions are expected to give bindings for all the non-storage members (values, constants and function members). For example, a counter class is defined as 3

struct counter { int x = 0; int inc() { return (++ this->x); An equivalent definition is struct counter { int x = 0; let int inc() = counter inc int counter inc(counter *this) { return (++ this->x); 3 Declarations There are two kinds of declarations: binding declarations (also called definitions) declare an identifier and bind it to a specific value; type declarations simply declare the type of an identifier. There are no declarations for δ val type. (It would be ambiguous as noted above.) For the other three kinds of types, the two declarations have the syntax S x = P; extern S x; S const x = P; extern S const x; δ f(t 1 x 1,...,T n x n ) {C δ f(t 1,...,T n ); The declaration S x = P means create a storage object of type S with name x and initialize it with the value denoted by P. This kind of a definition is a statement as well as a binding. It is a statement in that it must be executed to create storage. It is a binding in that it binds the name x to the newly created storage. The scope of the name is the remainder of the program text delimited only by braces. For example, in {...;S x = P;...; the scope of x is the text following the declaration. If there are no braces delimiting the scope, the scope would consist of the entire following text. In contrast, a function declaration is not a statement. It can only appear at the top level. Generic let binding In addition to these, we introduce a generic binding mechanism of the form let T x = P. This is illustrated by the following examples: let int val small = 255; /* integer constant */ let int var a = p.x; /* alias to p.x */ let int sqr(int) = {(int x) return x*x; /* function by a block expression */ let type matrix = float[][] /* type definition */ Member declarations In a struct definition, the declarations of member fields have the status of a binding declaration. So, they can have initializers, function definitions and let bindings. In addition, the definition of a derived struct (in C++) can redefine the members of the parent. 4

4 Module system Even though C++ does not have a module system, we define one because it solves many problems which are otherwise treated in an arbitrary fashion in C++. Specifically, we want a module system with the following objectives: integrate header files into the language, provide representation hiding for types as well as for classes, provide parametric polymorphism. Our module system is a direct take off from that of Standard ML. A module is a named collection of binding declarations. This is similar to a struct definition except that it may also have type definitions. Modules in turn have large types that are called signatures. Examples A signature for items in a binary search tree may be defined as follows: signature CMP { type T; bool operator==(t, T); bool operator<(t, T); Then a specific module of this signature might be: CMP module StringCmp { let type T = char[]; bool operator==(t s, T t){ return (strcmp(s,t) == 0); bool operator<(t s, T t){ return (strcmp(s,t) < 0); A simpler module for searching in a linked list can be defined as follows: signature EQ { type T; bool operator==(t, T); let EQ module StringEq = StringCmp; The signature of a searching module can be defined thus: signature TABLE { type item; struct tableops { bool member(item); void insert(item); struct table <: tableops; The structtable has been declared to be some unknown subtype oftableops. Here are two sample searching modules: 5

TABLE module SlowTable(EQ module Elem) { let type item = Elem.T; import List(Elem); /* imports list, member, cons etc. */ struct table: tableops { list elems; bool member(item x) {return member(x, this->elems); void insert(item x) {this->elems = cons(x, this->elems); TABLE module SearchTree(CMP module Elem) { let type item = Elem.T; struct table: tableops { item root; table *left, *right; bool member(item x) {... void insert(item x) {... Notice that the data members of the table struct have been made private by declaring the module to have signature TABLE. Moreover, we achieve polymorphism by module parameterization. Thus, the use of parameterized modules is a powerful mechanism that solves several problems at once. Signatures A signature named Σ is defined in the following fashion: signature Σ {D 1 ;...D n ; where each D i is a declaration. It may be one of the following: extern T x type t struct t <: t type t = T struct t {S 1 x 1 ;... Σ module M include Σ type declaration of a name declaration of an opaque type declaration of an opaque subtype definition of a type definition of a struct type declaration of a module inclusion of another signature Type compatibility for signatures is by structure, not by name. For example, CMP <: EQ above. Modules The definition of a module has the syntax: Σ module M{B 1 ;...;B n ; where each B i is a binding declaration. In addition to the binding declarations already mentioned, modules provide a new one: import M Its effect is to include in the current context all the bindings of the module M as constrained by its signature. 6

5 Phrases Phrases are terms that one writes in the language. These are not to be confused with expressions (arithmetic or other) which yield data values. A phrase always has a phrase type. Phrase types include all types T mentioned in Sec. 1 and, in addition, two special types: S init for initializers of S-typed storage objects, and δ stmt for statements which might return δ-typed values. The phrase type stmt is used for statements which do not return. They may be deemed to be δ stmts for any δ. These types are special in that they are only used for describing terms. They are not included in regular types. So, there are no parameters, identifiers or member fields of these special types. A phrase often has free identifiers each of which is of a specific type. Let x 1 : T 1,...,x n : T n be a list of such distinct identifiers (with order assumed insignificant). To say that P is a phrase of phrase type τ in such a context, we write x 1 : T 1,...,x n : T n P : τ Greek letters Γ,,...are used to stand for lists of free identifiers. We write Γ[x : T] to mean the context Γ with the entry for x (if any) replaced by x : T. 5.1 General phrases An identifier of type T is always a phrase of type T: Γ x : T The type of a phrase is convertible by subtyping: 5.2 Expressions if x : T is in Γ Γ P : T Γ P : T if T <: T A phrase of type δ val is called an expression. Note that expressions can only denote data values. int Examples: Γ 0 : int val Γ E 1 : int val Γ E 2 : int val Γ E 1 + E 2 : int val pointers A pointer is created by new and used by the dereferencing operator: Γ NULL : T val Γ P : T init Γ new T P : T val Γ E : T val Γ E : T Γ E : T val Γ delete E : stmt The initializer for a new function must always be top-level function name (to avoid dangling references). Note that we do not support the address operator &. This would lead to dangling references. 7

5.3 Statements We use braces { as parentheses for grouping statements. They can be dropped wherever unnecessary. The three basic operations are empty statement, sequencing and return: Γ { : stmt Γ return : void stmt Γ C 1 : δ stmt Γ C 2 : δ stmt Γ {C 1 ; C 2 : δ stmt Γ E : δ val Γ return E : δ stmt A non-returning statement can be deemed to return a result of any type: Γ C : stmt Γ C : δ stmt The infamous expression statement is in Core C++: 5.4 Objects Γ E : δ val Γ E : stmt A storage object is created by a binding declaration: Γ P : S init Γ[x : S] C : δ stmt Γ {Sx = P; C : δ stmt Γ P : S init Γ[x : S const] C : δ stmt Γ {S const x = P; C : δ stmt Variables Variables can be dereferenced and assigned: Γ X : δ var Γ X : δ val Γ X : δ var Γ E : δ val Γ X = E : δ val Γ X : δ var const Γ X : δ val Arrays Arrays are subscripted: Γ X : S[ ] Γ E : int val Γ X[E] : S Γ X : S[ ] const Γ E : int val Γ X[E] : S const Structures Structures have field selection Γ X : struct{...;tx;... Γ X.x : T Γ X : struct{...;tx;... const Γ X.x : T const Initializers Variable initializers are just expressions: Γ E : δ val Γ E : δ var init Array and structure initializers are suitable collections. They are omitted in this summary. 8

5.5 Functions A function is applied in the usual fashion: A function is built by a block expression: Γ P : δ(t 1,...,T n ) Γ Q 1 : T 1... Γ Q n : T n Γ P(Q 1,...,Q n ) : δ val Γ[x 1 : T n,...,x n : T n ] C : δ stmt Γ {(T 1 x 1,...,T n x n ) C : δ(t 1,...,T n ) Note that block expressions can be used to create downward closures. 6 Conclusion I hope this brief design notes has convinced the reader that the core of C and C++ are quite solid. The essential concepts can be repackaged in a coherent fashion. Let me mention some open issues which I haven t touched upon: 1. A formal semantics of the language must be defined and a coherence theorem must be proved stating that every type derivation gives the same meaning. For C++ this theorem is alleged not to be true. 2. All member functions are automatically virtual. Is this implementable efficiently? What about multiple inheritance? 3. We have ignored the issue of constructors and destructors. Do we need them? Can they be added cleanly? 4. Should functions return larger values? Other than garbage collection, what problems are there? How do expressions and phrases get reconciled? References [1] B. W. Kernighan and D. M. Ritchie. The C Programming Language, Second Edition. Prentice Hall, 1988. [2] J. C. Reynolds. The essence of Algol. In J. W. de Bakker and J. C. van Vliet, editors, Algorithmic Languages, pages 345 372. North-Holland, 1981. 9