CS Introduction to Data Structures How to Parse Arithmetic Expressions

Similar documents
CISC-235. At this point we fnally turned our atention to a data structure: the stack

Flow Control. So Far: Writing simple statements that get executed one after another.

Stacks. stacks of dishes or trays in a cafeteria. Last In First Out discipline (LIFO)

09 STACK APPLICATION DATA STRUCTURES AND ALGORITHMS REVERSE POLISH NOTATION

CSCI 204 Introduction to Computer Science II. Lab 6: Stack ADT

Intro. Scheme Basics. scm> 5 5. scm>

infix expressions (review)

CSC148H Week 3. Sadia Sharmin. May 24, /20

Formal Languages and Automata Theory, SS Project (due Week 14)

STACKS. A stack is defined in terms of its behavior. The common operations associated with a stack are as follows:

Text Input and Conditionals

Announcements. Lab Friday, 1-2:30 and 3-4:30 in Boot your laptop and start Forte, if you brought your laptop

Topic Notes: Bits and Bytes and Numbers

Some Applications of Stack. Spring Semester 2007 Programming and Data Structure 1

Stack Applications. Lecture 27 Sections Robb T. Koether. Hampden-Sydney College. Wed, Mar 29, 2017

Operators. Java operators are classified into three categories:

Stacks. Chapter 5. Copyright 2012 by Pearson Education, Inc. All rights reserved

ESCI 386 IDL Programming for Advanced Earth Science Applications Lesson 1 IDL Operators

These are notes for the third lecture; if statements and loops.

19 Much that I bound, I could not free; Much that I freed returned to me. Lee Wilson Dodd

Project 3: RPN Calculator

Objective- Students will be able to use the Order of Operations to evaluate algebraic expressions. Evaluating Algebraic Expressions

Topic Notes: Bits and Bytes and Numbers

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

Summer 2017 Discussion 10: July 25, Introduction. 2 Primitives and Define

Programming Lecture 3

CS164: Midterm I. Fall 2003

CS251 Programming Languages Handout # 29 Prof. Lyn Turbak March 7, 2007 Wellesley College

Only to be used for arranged hours. Order of Operations

CS112 Lecture: Variables, Expressions, Computation, Constants, Numeric Input-Output

Section 0.3 The Order of Operations

2.2 Syntax Definition

Lecture 4 CSE July 1992

CS 330 Homework Comma-Separated Expression

CISC

Data Structure using C++ Lecture 04. Data Structures and algorithm analysis in C++ Chapter , 3.2, 3.2.1

Language Reference Manual simplicity

// The next 4 functions return true on success, false on failure

Lecture 4: Stack Applications CS2504/CS4092 Algorithms and Linear Data Structures. Parentheses and Mathematical Expressions

Section 1.1 Definitions and Properties

CS125 : Introduction to Computer Science. Lecture Notes #11 Procedural Composition and Abstraction. c 2005, 2004 Jason Zych

5 R1 The one green in the same place so either of these could be green.

SCHEME 8. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. March 23, 2017

CS 211 Programming Practicum Spring 2018

ARG! Language Reference Manual

CS 211 Programming Practicum Spring 2017

Getting Started. 1 by Conner Irwin

Assessment of Programming Skills of First Year CS Students: Problem Set

An Introduction to Trees

Decisions, Decisions. Testing, testing C H A P T E R 7

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

CS W3134: Data Structures in Java

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

(Refer Slide Time 3:31)

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Language Basics. /* The NUMBER GAME - User tries to guess a number between 1 and 10 */ /* Generate a random number between 1 and 10 */

Objects and Types. COMS W1007 Introduction to Computer Science. Christopher Conway 29 May 2003

A Java program contains at least one class definition.

9/10/10. Arithmetic Operators. Today. Assigning floats to ints. Arithmetic Operators & Expressions. What do you think is the output?

Postfix (and prefix) notation

Introduction to Lexing and Parsing

1/30/2018. Conventions Provide Implicit Information. ECE 220: Computer Systems & Programming. Arithmetic with Trees is Unambiguous

An algorithm may be expressed in a number of ways:

CS 211 Programming Practicum Fall 2018

DEBUGGING TIPS. 1 Introduction COMPUTER SCIENCE 61A

Fundamentals. Fundamentals. Fundamentals. We build up instructions from three types of materials

5.5 Completing the Square for the Vertex

CS125 : Introduction to Computer Science. Lecture Notes #38 and #39 Quicksort. c 2005, 2003, 2002, 2000 Jason Zych

Ch. 12: Operator Overloading

DATA STRUCTURES AND ALGORITHMS

Table of Laplace Transforms

Project 5: Extensions to the MiniJava Compiler

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Have the students look at the editor on their computers. Refer to overhead projector as necessary.

Variables and Data Representation

(Provisional) Lecture 22: Rackette Overview, Binary Tree Analysis 10:00 AM, Oct 27, 2017

COMP 250 Winter stacks Feb. 2, 2016

CS52 - Assignment 8. Due Friday 4/15 at 5:00pm.

March 13/2003 Jayakanth Srinivasan,

Control Structures. Lecture 4 COP 3014 Fall September 18, 2017

CSCI 200 Lab 4 Evaluating infix arithmetic expressions

(Provisional) Lecture 08: List recursion and recursive diagrams 10:00 AM, Sep 22, 2017

Is the statement sufficient? If both x and y are odd, is xy odd? 1) xy 2 < 0. Odds & Evens. Positives & Negatives. Answer: Yes, xy is odd

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Bits and Bytes and Numbers

ECE220: Computer Systems and Programming Spring 2018 Honors Section due: Saturday 14 April at 11:59:59 p.m. Code Generation for an LC-3 Compiler

Special Section: Building Your Own Compiler

Grade 6 Math Circles November 6 & Relations, Functions, and Morphisms

Compilers and computer architecture From strings to ASTs (2): context free grammars

Programming, Data Structures and Algorithms Prof. Hema Murthy Department of Computer Science and Engineering Indian Institute of Technology, Madras

Lecture 7: Primitive Recursion is Turing Computable. Michael Beeson

Operators & Expressions

ADTS, GRAMMARS, PARSING, TREE TRAVERSALS

Divisibility Rules and Their Explanations

MA 1128: Lecture 02 1/22/2018

What Every Programmer Should Know About Floating-Point Arithmetic

CMPSCI 187 / Spring 2015 Postfix Expression Evaluator

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Implementing Programming Languages

esay Solutions Ltd Using Formulae

GBIL: Generic Binary Instrumentation Language. Language Reference Manual. By: Andrew Calvano. COMS W4115 Fall 2015 CVN

Transcription:

CS3901 - Introduction to Data Structures How to Parse Arithmetic Expressions Lt Col Joel Young One of the common task required in implementing programming languages, calculators, simulation systems, and spreadsheets is parsing and evaluating arithmetic expressions stored in strings. Arithmetic expressions are composed of binary operators, unary operators, numbers, and parenthesized sub-expressions. Parsing a fully parenthesized expression is straightforward as each sub-expression is bracketed by open and close parentheses so order-of-operations does not need to be considered. Typical expressions (as typed by humans) rely on the standard order of operations to resolve ambiguity. For example, -2^3^4x6 would parenthesize out to (-((2^(3^4)) x 6)). A typical system would perform a right-to-left pass over the input expression inserting parenthesis for exponentiation (highest-precedence) and then a (or several) left-to-right pass(es) adding parenthesis for the other operators by their precedence. In this note we discuss a strategy for directly parsing the string from left to right handling exponentiation, addition, subtraction, multiplication, division, and negation. The result of a successful parse is an expression tree which could be evaluated or used to convert the expression to post-fix or pre-fix or fully parenthesized in-fix notation. To help us, we are going to assume that we have a tokenizer that removes whitespace and separates the operators, parentheses, and numbers in an expression into separate strings called tokens. Most languages have such a capability either built in or in a library. For example, C++ has strtok or boost s tokenizer library. An example tokenization for 6 x 6 - (--3) would be Tokens: [(] [6] [x] [6] [-] [(] [-] [-] [3] [)] [)] Notice that there is an extra set of parenthesis in the above. For simplification during the parse, we are going to force all expressions to have leading and closing parenthesis. You can think of this as similar to using sentinels in parsing a linked list. The first thing we ll need is a means of expressing the precedence of operations. We ll use a mapping of operators to integers (could be implemented with an associative memory structure or hard-coded in a function). Our base precedence table is: operator priority ˆ 5 x 2 / 2 + 1-1 Notice that x and / have the same precedence. This is because they are tied for precedence and simply ordered from left to right. likewise for - and +. In addition, we have some internal operators that are used for bookkeeping: operator priority exponent 4 unary 3 open 0 Their use will be explained during the discussion below. Using the above table, we can compare the priority of two operators: priority( ^ ) priority( x ) which would be false. This will be useful for determining order of operations. We will store our state in five data-stores. We will have two push-downs (stacks) and three booleans: operators: stack holding operators as listed in the tables above. It contains all of the operators that we haven t yet found operands for. The top operator will be the last one we ve received a token for. expressions: stack holding expressions that have been partially parsed. The top expression on the stack is the last one we have processed. next is unary boolean that is true when the next operator to the right must be a unary operator and false otherwise. This is used to determine if the token - is a unary or binary operator and also to recognize ill-formed expressions like 6 + x 7. 1

last was expression boolean that is true when the previous token was a number and false otherwise. This is used to identify ill-formed expressions like (6 2 + 3). last was parenthesis boolean that is true when the previous token was an open paren and false otherwise. This is used to identify ill-formed expressions like ()5+6. In the end we will find that only one of the booleans are really required, but the reasoning is easier with them. Let s return to our tokenized representation of the expression. There are four kinds of tokens which we will need to respond differently two: We can imagine that if we looped through our tokens from left-to-right we could use a big compound if - else statement to select which kind of token we are processing each time. We ll assume that our tokenizer loop gives us the current token in a variable token and the following token in a variable next token. In this context, let s consider our boolean last was expression. This will need to be set to true any time we finish handling a number token or any other expression and false at all the other times. Also, if we are handling a number token or an open parenthesis, this variable better not be true! So adding this to the above list we get: (a) (*) if last was expression is true, throw an error. (b) (*) last was expression is false (a) (*) last was expression is true (a) (*) last was expression is false (a) (*) if last was expression is true, throw an error. (b) (*) last was expression is true Next let s think about next is unary. Consider the close parenthesis ). Are there any circumstances when the next token could be a unary operator after parsing this token? How about after immediately after a number? In an expression like ( 6 + 6 -...) is the - ever going to be a unary operation? In both cases the answer is no. Unary operators occur only after open parenthesis and after operators. Binary operators occur in exactly the opposite circumstances. Let s annotate our branches with this idea: (b) (*) next is unary is true (a) (*) next is unary is false (b) last was expression is true (a) (*) if next is unary is true then we know that token is a unary operator, otherwise it is binary: (b) (*) next is unary is true (b) (*) next is unary is false Next let s handle our third boolean last was parenthesis. We don t want expressions like () to be valid. Any time we have an empty parenthesis pair, we want to catch it. The simplest way is to set our boolean to true when we ve handled an open parenthesis: 2

(d) (*) last was parenthesis is true (a) (*) if last was parenthesis is true, throw an error. (d) (*) last was parenthesis is false (d) (*) last was parenthesis is false (d) (*) last was parenthesis is false Now that we have some basic machinery in place, let s think about the simplest sort of expression possible: The number. Consider the expression? 6?.? are unknown operators. Until their relative priority is known, we can t decide whether the number belongs to the left operator or the right operator. For example + 6 x would result in the number belonging to the right side and x 6 + would result in the number belonging to the left side. Since at this point we haven t yet processed the next token, let s just hold off on deciding what to do with it. So, if we receive a number token, let s just wrap it in an expression and push it on our expression stack: (d) last was parenthesis is true (b) (*) push number in token onto expressions What is the next thing we can readily handle? What if our current token is an open parenthesis? What do we know? Well, at this point, almost all we know is that the next token better not be a close parenthesis (which we covered above) and our previous token shouldn t have been either a number or a close parenthesis. We also know one more thing: Everything that follows to the next close parenthesis (which had better be there) will be bundled into an expression. At this time, we don t know how far away the next close parens is nor do we know what kind of expression will follow (number, binary, unary, or compound), so let s just push a marker onto our operations stack to help us remember where the expression started when we do get to the close parenthesis: 3

(b) (*) push open onto operations (d) last was expression is false (e) last was parenthesis is true Let s consider now how we would handle unary expressions. We already know when we should expect to see one: It will be when next is unary is true and our token is an operator. What should we do with it? The first thing to remember is that a unary operator (at least the only one we have to deal with) has only one operand (aka child expression) which follows it to the right. If the operator is a - we will push a unary expression of the form (-?) onto our expressions stack. The question mark is a place holder that we will fill in down the road once we ve fully parsed the operand. We ll also push the operation onto our operations stack. Because unary - is exactly the same character as binary - we ll also push a unary flag onto the stack to emphasize that the operation is unary. Check the priority of unary in the priority table. It is higher than everything other than exponentiation. Note that if the operator is anything else, there is an error in the input and we can stop. Here are our changes: (b) push open onto operations (d) last was expression is false (e) last was parenthesis is true A. (*) if token is - then (*) push -? onto expressions (*) push - onto operations (*) push unary onto operations B. (*) otherwise throw an error 4

Now let s consider handling the binary operators. The interesting thing with binary operators is determining how the bind to their operands because of the order of operations. Exponentials are the real kicker. They have higher priority than any other operator, and furthermore they bind right-to-left, rather than left to right. This means that an expression like -2^3^4 needs to bind as (-(2^(3^4))) and 3-2^3^4x6 is (3-((2^(3^4))x6)). Look back at the priority table. This is why ˆ has higher priority than unary and also why we have the exponent operator with lower priority than ˆ. When we get the binary operator, we first build our new binary expression with the form? token? where like above, the question mark is a place holder for the real expression. Before we push this new expression onto the expressions stack, we ve got to check the priority of this operator against the one on the top of the operations stack with an expression like: priority(operations.top) priority(token). If the operator on the stack is higher priority, we need to handle that one first. Note that the operator could be unary or binary so we ll need to check if the top operator is unary to decide if we are going to assemble a complete unary expression or a binary expression. We ll delay the discussion for how to actually assemble the expressions for a moment and, for now, just delegate it to functions. Finally note that there may be several higher priority operators on the stack so we need to do the above check in a loop instead of in a single if. This would happen with an expression such as 2^3^4 5 when processing the token. (b) push open onto operations (d) last was expression is false (e) last was parenthesis is true A. if token is - then push -? onto expressions push - onto operations push unary onto operations B. otherwise throw an error A. (*) while priority(operations.top) priority(token) (*) if operations.top is unary then: (*) assemble unary expression (*) otherwise: (*) assemble binary expression B. (*) push? token? onto expressions C. (*) if token is ˆ then (*) push exponent onto operations D. (*) otherwise: (*) push token onto operations 5

Well, now that we ve gotten to the point where we ve got expressions on our expressions stack ready to be assembled, how should we do it? Let s talk about assemble unary expression first. There are several things we need. First is to pop a couple operations off of the operations stack. At this point there will be a unary operator on top followed by a - operator. Next we need to take our operand off the top of the expressions stack. Then we need to take our placeholder expression -? off the stack. Next, we need to replace the question mark with our operand. Finally, we need to push our completed placeholder expression back on the stack. There are a couple of things that can go wrong here. First, there might be less than two expressions left on the stack and secondly, the expression we popped with the place holder might not actually be a unary expression! In either case, it means that the user entered an expression with a missing operand. Can you figure out why? Here is the pseudo-code for our assemble unary expression routine: 1 pop o p e r a t i o n s twice 2 pop e x p r e s s i o n s i n t o operand 3 pop e x p r e s s i o n s i n t o p l a c e h o l d e r 4 r e p l a c e the q u e s t i o n mark in p l a c e h o l d e r with operand Note that the safety checks to identify missing operands are missing from the above. Next we need to do assemble binary expression. It is very similar to our code for unary expressions except it doesn t work in a loop since there isn t enough information to determine if we can safely assemble the next expression. Here is the pseudo code: 1 pop o p e r a t i o n s 2 pop e x p r e s s i o n s i n t o r i g h t operand 3 pop e x p r e s s i o n s i n t o p l a c e h o l d e r 4 pop e x p r e s s i o n s i n t o l e f t operand 5 r e p l a c e the r i g h t q u e s t i o n mark in p l a c e h o l d e r with the r i g h t operand 6 r e p l a c e the l e f t q u e s t i o n mark in p l a c e h o l d e r with the l e f t operand As before, the safety checks for missing operands are not specified. The only thing left that we need to do is finish handling our close parenthesis tokens. When we we reach a closed parenthesis, we know that we can assemble any pending operations on the stack until we reach out open operator that we pushed on the stack when we got an the matching open parens. We need to remember to pop our open operator too! What s going to happen if we have an expression with an extra )? When we reach it we ll spin back and digest all of the expressions that come before and consume our sentinel open that came from the open and close parenthesis tokens we wrapped the user s expression with. At this point, our operations stack will be empty. The only time that is good is when we are at the end, in other words if next token is end. Let s see the results: (b) push open onto operations (d) last was expression is false (e) last was parenthesis is true i. (*) while operations.top open (*) if operations.top is unary then: (*) assemble unary expression 6

(*) otherwise: (*) assemble binary expression ii. (*) pop operations iii. (*) if operations is empty and next token is not at the end (*) throw an unbalanced parenthesis error A. if token is - then push -? onto expressions push - onto operations push unary onto operations B. otherwise throw an error A. while priority(operations.top) priority(token) if operations.top is unary then: assemble unary expression otherwise: assemble binary expression B. push? token? onto expressions C. if token is ˆ then push exponent onto operations D. otherwise: push token onto operations Now finally we have handled all the cases! If somehow we get to the end of our tokens and we still have operations on the operations stack or not exactly one expression on the expressions stack, we know the user entered a bad expression say with extra open parenthesis. Assuming that a good string was entered, you now have the final expression sitting as the only item in the expressions stack. Now let s consider our state. We have three boolean variables which should give us eight different possible states. But there are only four different transitions which means that only a maximum of four of those states are reachable. If we map out the transitions, we quickly see that last was expression is always inverted from next is unary so we can replace all references to it with next is unary is false in our algorithm. Further more we can notice that last was parenthesis is false whenever next is unary is false. The difference is that last was parenthesis is false after receiving a operation token and next is unary is true in that case. Well when is last was parenthesis used? Only when handling a closed parenthesis. What would happen if we got a closed parenthesis after getting an operator? It means we have a missing operand which is the same thing we were catching before. This means we can replace last was parenthesis with next is unary. Now we are down to only one state variable! By the way, if your implementation is carefully coded, you shouldn t need the safety checks in your assemble routines. Remember, we re going to loop through our tokens from left-to-right. Each token will be handled according as either a operator, parenthesis (open or closed), or as a number. When we finish looping, the operations stack should be empty and the expressions stack should have exactly one item, otherwise, we had an unbalanced parenthesis. Here are the final algorithms: 7

handle token: (a) (*) if next is unary is false, throw an error. (b) push open onto operations (a) (*) if next is unary is true, throw an error. i. while operations.top open if operations.top is unary then: assemble unary expression otherwise: assemble binary expression ii. pop operations iii. if operations is empty and next token is not at the end throw an unbalanced parenthesis error A. if token is - then push -? onto expressions push - onto operations push unary onto operations B. otherwise throw an error A. while priority(operations.top) priority(token) if operations.top is unary then: assemble unary expression otherwise: assemble binary expression B. push? token? onto expressions C. if token is ˆ then push exponent onto operations D. otherwise: push token onto operations (a) (*) if next is unary is false, throw an error. assemble unary expression: 1 pop o p e r a t i o n s twice 2 pop e x p r e s s i o n s i n t o operand 3 pop e x p r e s s i o n s i n t o p l a c e h o l d e r 4 r e p l a c e the q u e s t i o n mark in p l a c e h o l d e r with operand assemble binary expression: 1 pop o p e r a t i o n s 2 pop e x p r e s s i o n s i n t o r i g h t operand 3 pop e x p r e s s i o n s i n t o p l a c e h o l d e r 4 pop e x p r e s s i o n s i n t o l e f t operand 5 r e p l a c e the r i g h t q u e s t i o n mark in p l a c e h o l d e r with the r i g h t operand 6 r e p l a c e the l e f t q u e s t i o n mark in p l a c e h o l d e r with the l e f t operand 8