The Logical Design of the Tokeniser

Page 1 of 21 The Logical Design of the Tokeniser Purpose 1. To split up a character string holding a RAQUEL statement expressed in linear text, into a sequence of character strings (called word tokens), each of which holds a word of RAQUEL text extracted from the statement. It simplifies the next stage of tokenisation by removing all whitespace from the character string input and providing the next stage with individual words. 2. To do a check for unbalanced curly and round brackets, thereby preventing unnecessary further work if the brackets are un-balanced. To do an equivalent check for square brackets, bearing in mind that an additional square bracket is allowed by the syntax to denote special asymmetric versions of algebraic operators. RAQUEL Word Tokens A RAQUEL word token is defined as a character string denoting any one of the following : a RAQUEL keyword, which is the name of either of : an assignment, an algebra operator, the parameter prefix of an assignment or operator (e.g. With), a special symbol, either @ or # ; an assignment or operator parameter (including its enclosing square brackets); an additional square bracket to denote a special version of an algebraic operator; a left or right parenthesis (round bracket); the name of a container variable; a literal container value, be it a set of tuples - i.e. a relational value - or a bag of tuples or a sequence of tuples (including its enclosing curly brackets); a default tuple expressed as a 1-tuple literal container value (including its enclosing curly brackets); a literal scalar value. Each instance of every one of the above generates a RAQUEL word token. There are also two special statement delimiter RAQUEL word tokens - see below. Note that no RAQUEL word token can contain any whitespace characters, apart from any arising within the enclosing brackets of a parameter or literal value. Input A RAQUEL statement. Output 1. A sequence of RAQUEL word tokens as defined above. The order of the tokens reflects the order in which they appeared in the original statement. The sequence is

always prepended with the RAQUEL word token tab,tab and appended with the RAQUEL word token ; they are used as special delimiters so that it is always possible to check if any of the output has been lost.. 2. A Success list. If it is empty, it indicates that the tokenising was successful. If it contains one or more positive integers, each integer indicates a particular type of error that was detected; the order of the numbers indicates the sequence in which the errors were found. Syntax Rules There are two situations where it is necessary to have explicit syntax rules to distinguish between two otherwise ambiguous expressions : 1. Asymmetric join operators indicate whether they are left or right joins by having an additional square bracket, either a [ or a ], attached to the left or right hand end, respectively, of their parameter. Ideally whitespace could appear between the additional bracket and the parameter bracket to permit the user to format the statement as they wish. However the generalised join has a truth-valued expression as its parameter, and this could include operators and Selector literal values which themselves have parameters surrounded by [ and ]. Thus a situation could arise where the right hand end of the generalised join parameter could not be reliably recognised due its terminating in a sequence of right hand square brackets. To avoid this, the syntax constrains the user to leave no whitespace between the two right hand terminating brackets of a right hand join, and to leave at least one whitespace character between the right hand brackets that arise from nesting. For consistency this is applied to both generalised and natural joins, and also to the two left hand terminating brackets of both left hand joins. 2. Ideally there would always be whitespace left between the operands of arithmetic operators and the operator itself. However in practice many users would expect to be able to input valid arithmetic expressions without including such whitespace, e.g. 3-4. This causes a problem with the - character since in addition to - being used to represent the subtract operator, it is also used as a numeric prefix to indicate that the number in question has a negative value. This raises the question as to whether 3-4 actually indicates a subtraction expression or a positive number followed by a negative number. To conform with common usage, the syntax constrains the user to always have at least one whitespace character immediately before a negative number s - prefix, in order to distinguish this use of - from its use as the subtract operator. Thus 3- -4 would be accepted as meaning subtract -4 from 3 whereas 3--4 would result in the Tokeniser considering -- to be a word on its own, ultimately resulting in an error. The same considerations apply if the subtract operator is replaced by another arithmetic operator character. Note that this means that a statement commencing with a negative number must actually start with at least one whitespace character before the negative number. Page 2 of 21

Delimiters These are : whitespace : one or more of space, tab, newline and/or carriage return characters; brackets : (, ), {, }, [, ]. Alphanumerics These are : Text : a.. z A.. Z Text Delimiter : Numeric : 0.. 9. Standalone : + - * / ^ = ~ < > @ # ; :, These characters are meaningful (or potentially meaningful) on their own and not as part of words. They are also characters that users may not always delimit by putting whitespace characters round them, and so will have to be specially picked out by the tokeniser. Miscellaneous :! $ % & _ \? These comprise all the other remaining characters on a current standard British keyboard that not specified above. Variable Names These can be composed from any of the text, miscellaneous and numeric characters, as long as the first character is a text character. Page 3 of 21

Example One Input : Character string RELATION <--Delete @ Restrict[ Attr > 3 ] Output : RAQUEL word token sequence : tab,tab RELATION <--Delete @ Restrict [ Attr > 3 ] Success = empty list Example Two Input : Character string EMP Restrict[ Attr1 Member Attr2 ] Project[ DName ] Output : RAQUEL word token sequence : tab,tab EMP Restrict [ Attr1 Member Attr2 ] Project [ DName ] Success = empty list Page 4 of 21

Example Three Input : Character string PART Join{ NoPurchase, N/A }[[ SNo ]]{ Factory, 0 } SUPPLIER Restrict[ Wt > 50 ] Output : RAQUEL word token sequence : tab,tab PART Join { NoPurchase, N/A } [ [ SNo ] ] { Factory, 0 } SUPPLIER Restrict [ Wt > 50 ] Success = empty list Page 5 of 21

Pre-Conditions Input Variable Input? : character string Condition 0 length(input?) max (where max = maximum length of character string that can be passed into the Tokeniser). Post-Conditions Output Variables Output! : sequence( word tokens ) Success! : sequence(n) Conditions length (Success!) = 0 length (Input?) > 0 And 2 < length (Output!) And Output![1] = tab,tab And Output![ length (Output!) ] = And w ( Output![ 2.. ( length(output!) - 1 ) ] ) w = subsequence( Input? ) And w1, w2 ( Output![ 2.. ( length(output!) - 1 ) ] ) w1 = Output!( n ) w2 = Output!( n+1 ) < w1 catenate w2 > = subsequence( Input? Difference whitespace ) length (Success!) > 0 length (Output!) = 0 length (Input?) = 0 length (Success!) = 1 And length (Output!) = 0 Page 6 of 21

Tokeniser States There are three main states the tokeniser can be in : 1. The state. In this state, the tokeniser is proceeding through the assignment and/or algebra part of a statement. 2. The state. In this state, the tokeniser is proceeding through the parameter part of an assignment and/or algebra operator. 3. The Literal state. In this state, the tokeniser is proceeding through a literal relation or a default tuple. The tokeniser proceeds through a complete statement from left to right. Since in RAQUEL every valid parameter must be preceded by a RAQUEL keyword to its left, the initial state in tokenising is the standard state. Valid text in the standard state is always determined solely by the syntax of linear RAQUEL. Valid text inside a parameter is determined not only by the syntax of RAQUEL, but also by the syntax of the operators and values of the different domains that can appear within parameters; added to which, RAQUEL statements can appear recursively within parameters to any depth. Because of the consequent potential complexity of a parameter, the tokeniser makes no attempt to break it up into words, as it does with text in the standard state. Instead it simply produces a complete parameter as a single RAQUEL word token, that includes the initial and final square brackets. Valid text in a literal container or a default tuple (known for short as a literal) is likewise determined by the syntax of RAQUEL operators, and also by the syntax of the operators and values of the different domains that can appear within literals; nested literal containers can also appear there; finally RAQUEL expressions can appear within them to retrieve values are to become part of the literal. Therefore for analogous reasons to those for parameters, the tokeniser produces a literal as a single RAQUEL word token, including the initial and final curly brackets. It is convenient for the state to have several substates associated with it. One such substate is the Keyword state. This is entered when the state finds the beginning of a keyword. The Keyword state continues till the tokeniser finds the end of the keyword; then the Keyword state ceases and returns to the standard state. An exception to this is if the or Keyword states find a [ or {. As this indicates a parameter or a literal respectively, they change the state to the or Literal state. Keywords are normally words and hence delimited by whitespace characters and/or brackets of various kinds. However some keywords are actually single characters, e.g. +, =. If such characters were always delimited by spaces and/or brackets, the same procedure for handling words would suffice to handle them as well. However common usage does not always delimit them in this way; e.g. x>3 may be written instead of x > 3, but users would expect the former to be as acceptable as the latter. Hence these socalled Standalone characters are handled differently, each one being put into its own word token whether or not it is delimited. Page 7 of 21

Standalone characters may also be combined into sequences of 2 3 characters which still form standalone keywords in that they are not always delimited; e.g. >= and the prepends to assignments (which distinguish assignments from operators). So where 2 or more standalone characters are found sequentially, they are combined into one word token. However if any character, be it whitespace or anything else, interposes itself between 2 standalone characters, the standalone characters are not combined into one word token but each put into its own word token. Other substates of the state are required to handle values of primitive data types. A primitive data type is defined as a type whose permissible values are expressed via an Innate representation. An innate representation is so called because values of that type are innately recognised by their representation. By contrast, values of a nonprimitive type are recognised by being explicitly labelled as being of a certain data type and not by means of their innate means of representation. Each innate representation require the DBMS to possess built-in means of recognising such values. Such means are not required for non-primitive types; standard plug-ins to the DBMS suffice to provide it with this ability. There are two primitive types which must be handled, and both affect the tokeniser in order that the appropriate contents are put into word tokens. They are the numeric and textual types. Consequently there are Number and Text substates to handle them. In principle other kinds primitive data type could be required in future, but no such types are currently considered. The correct enclosure of expressions within parentheses is checked for within the state; the checking is omitted within the and Literal states where parentheses are treated merely as another character within a parameter or literal word 1. The state also has the associated sub states Enter?, Param[?, and Exit?. They arise from the complications caused by the fact that the RAQUEL syntax allows a second square bracket to appear around a parameter (either symmetrically or asymmetrically) to indicate a special version of the associated operator. To represent this, the additional square brackets become tokens in their own right. Consequently [[ does not signify two levels of enclosure/nesting by square brackets, only one. Likewise ]] signifies only one level of disclosure/un-nesting. In both cases, whitespace cannot appear between the two brackets, otherwise two brackets signifies two levels of en/disclosure. Thus entry to the state is via the Enter? substate in order to check for [[. Enter? also checks to see if three or more consecutive [ have occurred, and if so generates a suitable error message. 1 Where RAQUEL parameters themselves consist of RAQUEL statements, the tokeniser, compactor and parser are together called recursively to handle them. Thus parentheses and curly brackets within such parameters are checked by the recursive calling. Likewise, parentheses and square brackets within literals are checked by the same recursive process. Page 8 of 21

In order to be certain when a parameter ends, it is necessary to follow the levels of nesting and un-nesting caused by square brackets within the parameter; unbalanced square brackets within a parameter make it impossible to know where a parameter ends. So within a parameter, it is necessary to treat [[ as one bracket that adds one level of enclosure this is the purpose of the Param[? substate. The Exit? substate deals with all occurrences of ]]. Thus Exit? has to deal with the two questions whose corresponding inverses are dealt with by Enter? and Param[? : the level of disclosure implied by ] or ]], in order to see if the parameter has ended; the semantic implication of ]] that a special version of the operator was written. In fact, once Exit? has checked that ]] terminates a parameter correctly, the state is returned to the state; if a further consecutive ] occurs before any text, then the error is suitably dealt with by the state. The Literal state has to deal with a similar situation to the parameter state, except that it is much simpler by virtue of the fact that, unlike square brackets, double curly brackets do not arise as valid text. So the nesting of curly brackets within each other conforms to the normal nesting of brackets within each other. The end of a Literal state is simply indicated by there being no longer any text enclosed within curly brackets. Page 9 of 21

Variables and Their Usage C is used to count the depth of nesting of curly brackets, { and }. P is used to count the depth of nesting of parentheses, ( and ). S is used to count the depth of nesting of square brackets, [ and ], ignoring any additional brackets used to represent special variants of operators. Thus C, P and S must be set to zero before starting the tokenising. After tokenising, the following checks are carried out : IF C > 0 THEN generate unbalanced { error. IF P > 0 THEN generate unbalanced ( error. THEN generate unbalanced [ error. IF C < 0 THEN generate unbalanced } error. IF P < 0 THEN generate unbalanced ) error. IF S < 0 THEN generate unbalanced ] error. RL and RR are used to count successive repetitions of left and right square brackets respectively, in the process of checking for the error of too many [ or ] when handling special versions of an operator. AL is a truth-valued flag that records whether the last character to be put into a word token was a Standalone character or not. It is set to false initially. It s purpose is to support the algorithm to create Standalone word tokens. When a standalone character starts a new word token, AL is set to true. If another standalone character immediately follows it, it is appended to the (standalone) word token and AL left unaltered. If a non-standalone character immediately follows any standalone character, the Standalone word token is completed, the non-standalone character put into a new word token (if appropriate), and AL is set to false. This affects the, Number and Keyword states. After creating a Standalone word token, the state always remains in or reverts to the state. Prefix is a truth-valued flag that records whether a - sign could prefix a number to indicate that the number is a negative number or whether it represents the subtract operator. It is set to false initially and only set to true when a - sign is preceded by a whitespace character since this indicates that it could be the negative prefix of a number; it is reset to false after use. Page 10 of 21

State Transition Diagram of the Tokeniser whitespace ( ) { } [ ] Literal numeric { { } { less standalone [ Number Text less ( standalone numeric ) Keyword standalone whitespace ( ) } ] [ [ [ whitespace Enter? standalone whitespace ( ) } ] ( ) { } ] { ( ) } ] Exit? ( ) { } [ ] Notes : Page 11 of 21 ] whitespace numeric Number Number whitespace less ( numeric ) ( ) } ] ( ) { } ] Param[? [ whitespace [ whitespace ( ) { }

Detailed Specification of the Tokeniser This specifies, by tokeniser state, the action carried out for each kind of character input - termed the event - and any state change that follows as a consequence. In specifying state changes, for terseness the term wtoken is used to mean word token. Start Create a tab,tab word token. Move to the state. All subsequent word tokens created are appended to this word token. State : Event Action State Change standalone numeric IF AL = true THEN Append character to standalone Prefix false. ELSE ( IF current wtoken not empty THEN complete wtoken ) Start standalone Insert character in it. AL true. IF standalone = - AND previous event = whitespace THEN Prefix true. ELSE Prefix false. IF previous event = - AND Prefix = true THEN Append character to standalone Prefix false. AL false. ELSE IF AL = true THEN Complete AL false. Start numeric wtoken; insert character in it. whitespace -- IF AL = true THEN Start text wtoken & insert into it. Complete Start keyword less AL false. Insert character into it. ( standalone numeric ) Number Text Keyword Page 12 of 21

( Create (- P P + 1 ) Create )- P P - 1 { Start literal IF AL = true Insert { into it. THEN C C + 1 Complete } AL false. Generate unbalanced } error. [ Start parameter Insert [ into it. S S + 1 RL 1 ] Generate unbalanced ] error. Literal Enter? No more input. Create - End Note The case of a standalone event where AL is not true and the current word token is empty (and therefore should not be completed) only arises if a RAQUEL statement begins with a standalone character. However since the algorithm does not keep track of how far through the input statement it has got, it is easier to check whether the current word token is empty rather than check if the standalone character is at the beginning of the statement (and hence the current word token is empty and therefore should not be completed). State : Text Event Action State Change Insert & complete text All other characters. No more input. Append character to text Complete text Create - Generate incomplete text error. End Page 13 of 21

State : Number numeric Event Action State Change whitespace standalone Append character to number Complete number Start standalone Insert character into it. AL true Start text Insert into it. less ( standalone numeric ) Start keyword Insert character into it. ( Create (- IF wtoken P P + 1 valid number ) THEN Generate Create )- invalid number P P - 1 { error. Start literal Insert { into it. C C + 1 } Generate unbalanced } error. [ Start parameter Insert [ into it. S S + 1 RL 1 ] Generate unbalanced ] error. -- Text Literal Enter? No more input. Create - End Execute first. Execute next. Page 14 of 21

State : Keyword Event Action State Change whitespace Complete keyword standalone less standalone Complete keyword Start standalone Insert character into it. AL true Append character to keyword ( Complete keyword Create (- P P + 1 ) Complete keyword Create )- P P - 1 { Complete keyword Start literal Insert { into it. C C + 1 } Complete keyword Generate unbalanced } error. [ Complete keyword Start parameter Insert [ into it. S S + 1 RL 1 ] Complete keyword Generate unbalanced ] error. No more input. Complete keyword Create - Literal Enter? End Page 15 of 21

State : Literal Event Action State Change whitespace Append whitespace to literal Append character to literal ( Append ( to literal ) Append ) to literal { Append { to literal C C + 1 } Append } to literal C C - 1 IF C = 0 THEN Complete literal [ Append [ to literal ] Append ] to literal No more input. Complete literal Create - Generate unbalanced { error. IF C = 0 THEN End Page 16 of 21

State : Enter? Event Action State Change whitespace RL 0 RL 0 Append character to parameter ( RL 0 Append ( to parameter ) RL 0 Append ) to parameter { RL 0 Append { to parameter } RL 0 Append } to parameter [ RL RL + 1 IF RL = 2 THEN Complete [- Start parameter Insert [ into it. ELSE Generate too many successive [ error. ] Append ] to parameter RL 0 S S 1 RR 1 No more input. Create - Generate incomplete parameter error. Exit? End Page 17 of 21

State : Event Action State Change whitespace Append whitespace to parameter Append character to parameter ( Append ( to parameter ) Append ) to parameter { Append { to parameter } Append } to parameter [ Append [ to parameter S S + 1 RL 1 ] Append ] to parameter S S 1 RR 1 No more input. Complete parameter Create - Generate incomplete parameter error. Param[? Exit? End Page 18 of 21

State : Param[? Event Action State Change whitespace RL 0 RL 0 Append character to parameter ( RL 0 Append ( to parameter ) RL 0 Append ) to parameter { RL 0 Append { to parameter } RL 0 Append } to parameter [ Append [ to parameter RL RL + 1 IF RL > 2 THEN Generate too many successive [ error. ] RL 0 S S 1 RR 1 No more input. Create - Generate incomplete parameter error. Exit? End Page 19 of 21

State : Exit? Event Action State Change whitespace RR 0 THEN Append character to parameter wtoken ELSE Complete parameter Start wtoken without contents. RR 0 THEN Append character to parameter wtoken ELSE Complete parameter Start keyword Insert character into it. ( RR 0 THEN Append ( to parameter wtoken ELSE Complete parameter Create (- ) RR 0 THEN Append ) to parameter wtoken ELSE Complete parameter Create )- { RR 0 THEN Append { to parameter wtoken ELSE Complete parameter Start literal Insert { into it. C C + 1 } RR 0 THEN Append } to parameter wtoken ELSE Complete parameter Create unbalanced } error. [ RR 0 Generate missing keyword error. S S + 1 THEN ELSE THEN ELSE THEN ELSE THEN ELSE THEN ELSE Literal THEN ELSE Page 20 of 21

] RR RR + 1 IF RR = 2 AND S = 0 THEN Complete parameter Create ]- IF RR > 2 THEN Generate too many successive ] error No more input. Complete parameter Create - THEN Create incomplete parameter error IF S = 0 THEN End End In this state, the tokeniser terminates. Notes For simplicity, the above logical design always requires a word token at the end of the word token list. However this is logically unnecessary if an error is found in the statement, since the Tokeniser should return an empty word token list if an error is found in the statement. In certain circumstances, it is not necessary to create the word token immediately before the word token when an error has been found at the end of the statement. Page 21 of 21