Component Compilers. Abstract

Similar documents
Yacc: A Syntactic Analysers Generator

Automatic Scanning and Parsing using LEX and YACC

PRACTICAL CLASS: Flex & Bison

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Lexical and Parser Tools

CST-402(T): Language Processors

Using an LALR(1) Parser Generator

Compiler Lab. Introduction to tools Lex and Yacc

(F)lex & Bison/Yacc. Language Tools for C/C++ CS 550 Programming Languages. Alexander Gutierrez

DOID: A Lexical Analyzer for Understanding Mid-Level Compilation Processes

Ray Pereda Unicon Technical Report UTR-03. February 25, Abstract

Syntax Analysis Part IV

CS131: Programming Languages and Compilers. Spring 2017

Compiler Design 1. Yacc/Bison. Goutam Biswas. Lect 8

Compiler Design Overview. Compiler Design 1

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Working of the Compilers

CS 415 Midterm Exam Spring 2002

INTRODUCTION TO COMPILER AND ITS PHASES

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

An introduction to Flex

COMPILER CONSTRUCTION Seminar 02 TDDB44

LECTURE 11. Semantic Analysis and Yacc

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Programming Language Syntax and Analysis

TDDD55 - Compilers and Interpreters Lesson 3

Syntax. 2.1 Terminology

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

Quick Parser Development Using Modified Compilers and Generated Syntax Rules

UNIT III & IV. Bottom up parsing

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

TDDD55- Compilers and Interpreters Lesson 3

Yacc Yet Another Compiler Compiler

COMPILERS AND INTERPRETERS Lesson 4 TDDD16

CSE302: Compiler Design

Introduction to Compiler Design

Figure 2.1: Role of Lexical Analyzer

Compiler Construction

Syntax-Directed Translation

Parsing How parser works?

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Compiler Design. Dr. Chengwei Lei CEECS California State University, Bakersfield

What do Compilers Produce?

Pioneering Compiler Design

Introduction to Lex & Yacc. (flex & bison)

LECTURE NOTES ON COMPILER DESIGN P a g e 2

Preparing for the ACW Languages & Compilers

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

COP 3402 Systems Software Syntax Analysis (Parser)

Language Processing note 12 CS

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Compil M1 : Front-End

The structure of a compiler

Chapter 3 Lexical Analysis

CA Compiler Construction

CSE 3302 Programming Languages Lecture 2: Syntax

CSCI Compiler Design

How do LL(1) Parsers Build Syntax Trees?

Chapter 2 :: Programming Language Syntax

Compiler course. Chapter 3 Lexical Analysis

1. The output of lexical analyser is a) A set of RE b) Syntax Tree c) Set of Tokens d) String Character

Syntax Analysis The Parser Generator (BYacc/J)

CPS 506 Comparative Programming Languages. Syntax Specification

COMP 3002: Compiler Construction. Pat Morin School of Computer Science

Formats of Translated Programs

A Bison Manual. You build a text file of the production (format in the next section); traditionally this file ends in.y, although bison doesn t care.

Hyacc comes under the GNU General Public License (Except the hyaccpar file, which comes under BSD License)

Etienne Bernard eb/textes/minimanlexyacc-english.html

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Introduction to Yacc. General Description Input file Output files Parsing conflicts Pseudovariables Examples. Principles of Compilers - 16/03/2006

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Software II: Principles of Programming Languages

Principles of Programming Languages [PLP-2015] Detailed Syllabus

Principles of Programming Languages

VIVA QUESTIONS WITH ANSWERS

As we have seen, token attribute values are supplied via yylval, as in. More on Yacc s value stack

Compiler Design (40-414)

Earlier edition Dragon book has been revised. Course Outline Contact Room 124, tel , rvvliet(at)liacs(dot)nl

COMPILER DESIGN LECTURE NOTES

Syntax. A. Bellaachia Page: 1

Over-simplified history of programming languages

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Chapter 4. Lexical and Syntax Analysis

Languages and Compilers

Compiler Front-End. Compiler Back-End. Specific Examples

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Syntax Analysis/Parsing. Context-free grammars (CFG s) Context-free grammars vs. Regular Expressions. BNF description of PL/0 syntax

An Introduction to LEX and YACC. SYSC Programming Languages

Parser Design. Neil Mitchell. June 25, 2004

Table of Contents. Chapter 1. Introducing Flex and Bison

Programming Languages Third Edition. Chapter 7 Basic Semantics

CS Compiler Construction West Virginia fall semester 2014 August 18, 2014 syllabus 1.0

Transcription:

Journal of Computer Engineering Vol. 1 No. 1 (June, 2011) Copyright Mind Reader Publications www.journalshub.com Component Compilers Joshua Urbain, Morteza Marzjarani Computer Science and Information Systems Department Saginaw valley State University, 7400 Bay Road, University Center, MI 48710 Abstract In the computer industry, there are techniques that must be learned to achieve a successful career. One of these techniques is learning a computer language and compiling programs using that language. Compilers take a language and after a strenuous process, convert the programming language into a computer-understandable program. Compilers are generally made to deal with a single language using the predefined word and grammar to achieve a goal. The compiler is complex and can take a large amount work to get complete and efficient. With the creation of compilation components, like Yacc and Lex, programmers can have a compiler that is more generalized for languages, abandoning half of a compiler s code (still requiring a semantic analyzer, code generator, and code optimizer) and still yielding the same results as a fully-coded compiler. Narrative Yacc is a parser generator created for Unix systems. Yacc stands for Yet Another Compiler Compiler. It was developed at AT&T by Stephen C. Johnson (who is known for developing the Portable C Compiler, as well as some Unix tools). Yacc takes an input file which explains Backus-Naur formalisms (BNF) for the language. BNFs are widely used as a notation for the grammars of computer programming languages, instruction sets and communications protocols, as well as a notation for representing parts of natural language grammars (for example, meter in Sanskrit poetry.) Most textbooks for programming language theory and/or semantics document the programming language in BNF (Wikipedia). One of the interesting uses of Yacc is using Yacc to process a COBOL grammar. From this grammar, they use the output to create a way to convert COBOL programs into C. The figure shows the input file used in Yacc to explain Backus-Naur formalisms and Yacc s output. %{ #include <stdio.h> /* For I/O */ %} %start program %token <intval> NUMBER /* Simple integer */ %token <id> IDENTIFIER /* Simple identifier */ %% program : numbers IDENTIFIER ; numbers : /* empty */ numbers NUMBER ; %% main(int argc, char *argv[]) { yyparse();

Joshua Urbain, Morteza Marzjarani } yyerror(char*s) /* Called by yyparse on error */ { } Figure 1 Yacc takes the grammars given via a BNF-like form and reates an output program which can interpret languages From the first glance at the figure, Yacc s form is amazingly like that of Lex, having two percentage symbols to denote a change in the section of the input file. The form of the document follow the same rules as that of Lex as well, the top section (or the Prologue section) having space to put macros, the center section (or the Grammar rules section) for grammar rules, and the bottom section for any code needed to complete the file (the manual states that this section, Epilogue, can contain any code needed). Yacc takes the grammatical rules given in the input file, processes them and outputs to a larger C program able to understand the grammar given via the input that was given to Yacc. The C functions that are available to the programmer are yyparse, which parses the token, does the appropriate action, and returns the state (whether it was successful it returns YYACCEPT, otherwise it returns YYABORT). Another powerful function available is yyerror, which is able to report errors, whether they are syntactic or any other type, they are able to be flagged via yyerror. The error code also allows for error-recovery, which can correct implemented cases to reduce reconstruction of a grammar. As Yacc has become popular, many different versions that have been made since its release to Unix. Here are some of the many versions available: Berkeley Yacc is much like original Yacc, except it was created using ANSI C and is in the public domain. The limitation to Berkeley Yacc is that it is Linux-based and that reduces the users that can utilize this version. Another version is GNU s Bison, which are almost completely cross-platforms (even having a Windows 32-bit version). Bison uses a Look Ahead Left-to right Rightmost (LALR) grammar and outputs to C or C++ programs. The nature of Bison being open-source and cross-platform makes it normally paired with Flex to be a powerful pair of components. ML-Yacc is a version of Yacc that outputs to the Standard ML language. CL-Yacc is a version of Yacc using LALR(1) parsing and outputs in Common Lisp. Yecc is a version of Yacc that is made for output to Erlang. Happy is a version of Yacc that generates output for Haskell and can even parse Haskell with itself. YAXX is a Yacc extension to XML allowing Yacc output to be sent to XML for alternate methods of parsing (possible web-based parsing and portability). Bison and all its related versions take the grammar and create a C program (in the case of using Yacc). In Figure 2, is a snippet of the output from Yacc s processing. From the example in Figure 1, we get outputted code 43% increase (in lines) than the input. The up-side to this is that it scales astonishingly, handling the 1,325 lines of grammar for COBOL in 1,900 lines of outputted code. /* A Bison parser, made from ex.y, by GNU bison 1.75. */ #define YYBISON 1 /* Identify Bison output. */ #define NUMBER 258 #define IDENTIFIER 259... /* YYNTOKENS -- Number of terminals. */ #define YYNTOKENS 5 /* YYNNTS -- Number of nonterminals. */ #define YYNNTS 3

Component Compilers /* YYNRULES -- Number of rules. */ #define YYNRULES 4 /* YYNRULES -- Number of states. */ #define YYNSTATES 6... #line 16 "ex.y" main(int argc, char *argv[]) { } Figure 2 Yacc takes the grammar given and creates an output program which can interpret language grammar. In Aho, Sethi, and Ullman s Compiler s Principles, Techniques and Tools, there is a section dedicated to using Yacc as a syntax analyzer for programming languages. This book originally was published in 1986, while the Unix version, Yacc, was large. Below in Figure 3, is a similar figure as that using in the book, showing the process of using Yacc in as a compiler (assuming we are using C as the output). From the y.tab.c program, Yacc creates an entire file dedicated to parsing the states of the language. This tool removes the pains of programming state diagrams into a program, as it automates this process very easily. Yacc specification ex.y Yacc compiler y.tab.c Yacc compiler output y.tab.c C compiler a.out Input a.out Output Figure 3 Creation of a syntactic analyzer using Yacc (this diagram was originally shown in Aho, Sethi, and Ullman s Compiler s Principles, Techniques and Tools) Yacc has been called a compiler tool, but I m willing to call it a compiler component, allowing for the parsing/scanning to be covered completely by Yacc and its surrounding code. Compilers have many elements that create their complex system. By using just Yacc, a portion of work in making a compiler is reduced, but there are many other components available to make creating a compiler simpler. By using the output from Yacc, we can move to the next component, Lex. Lex is a lexxer or a Lexical Analyzer Generator, which means it takes a sentence pattern and creates it from the information given from the input data. Lex was created by Eric Schmidt (well-known as the CEO of Google Inc.) and Mike Lesk (known for his work with Digital Libraries and funding of the research project which became Google). Lex is part of the POSIX standard and used on various Unix systems. Lex takes an input file for its processing rules. From these rules, it creates either a C file or Ratfor (which is said to be simply converted to portable Fortran). An example of this code is shown below in Figure 4. %{ /* ** Example lexical structure. ** */ 15

Joshua Urbain, Morteza Marzjarani #include <string.h> /* for strdup */ #include "ex.tab.h" /* The tokens */ %} DIGIT [0-9] ID [a-z][a-z0-9]* %% {DIGIT}+ { return (NUMBER); } {ID} { return (IDENTIFIER); }. { return (yytext[0]); } %% int yywrap(void) { } Figure 4 Lex can reduce the difficultly of a lexical analysis by making a single input file to dictate the rules to follow. The Lex manual states that the format of a Lex rule files must be separated by two percentage symbols to go from the three-parts in a rule file. The three-parts being: definitions, rules, and user subroutines, respectively. Definitions can be used to tell Lex which tokens are terminal, identify macros, and import header files for C. Rules tell what the syntax is for a structure to be made, whether it is a statement or a declaration, using regular expressions and related C code. Subroutines (sometimes called the C-code section) are used for post-processing use, to make the output clean, or for more functionality. Lex also allows for regular expressions, which is one of the reasons Lex is such a powerful analyzer. Although Lex is was originally proprietary, over time the AT&T coded versions were released as opensource. This also spurred the creation of Flex, which is an open-source free alternative to Lex. Flex stands for Fast Lexical Analyzer Generator and was created around 1987 by Vern Paxson. Flex uses the same files as Lex and produces similar output. It is not a GNU project, but the manual was created by the GNU project. Most versions of Linux are able to download and run Flex; it has even been ported to Windows 32-bit executable by third-parties. Continuous work has been done on Flex, allowing for improvements, like Flex++ which allows the powers of Flex with output to C++ for classes and other advanced operations, also JFlex, which is a Fast Scanner Generator for Java. Lex and all its similar counterparts take the rules and create a C or Ratfor program (in the case of using Lex). In Figure 5, you will see a snippet of output from Lex s processing. From the example in Figure 4, we get outputted code 2,250% larger (in lines) than the input. The up-side to this is that it scales quite well, handling the 442 lines of rules for COBOL in 10,000 lines of outputted code. case YY_STATE_EOF(COPY_STATE): #line 49 "cobol_lex.l" {} YY_BREAK case 7: YY_RULE_SETUP #line 51 "cobol_lex.l" {return(tok_integer);} YY_BREAK

Component Compilers case 8: YY_RULE_SETUP #line 53 "cobol_lex.l" {return(tok_float);} YY_BREAK Figure 5 Lex takes the rules given and creates an output program which can interpret the rules given. In Aho, Sethi, and Ullman s book there is another section dedicated to using Lex as a lexical analyzer for programming languages. During the time, the Unix version, Lex, was at large. Below in Figure 6, is a similar figure as that using in the book, showing the process of using Lex in as a compiler (assuming we are using C as the output of Lex). From the lex.yy.c program, Lex creates two functions: yytext and yyleng. yytext being the first character of the lexeme, and yyleng is the length of the lexeme. This vast control of incoming lexemes makes the output program powerful. Lex source program lex.l Lex compiler lex.yy.c Lex compiler output lex.yy.c C compiler a.out Input stream a.out Sequence of tokens Figure 6 Creation of a lexical analyzer using Lex (this diagram was originally shown in Aho, Sethi, and Ullman s Compiler s Principles, Techniques and Tools) Using these tools together can make a vast majority of complex parts from a compiler. Combined, these tools work amazingly well, seeming as though they were created by the same group (even though they are not). A good example of using these tools can be done quite fast and with readily-available tools in Linux. Using a campus Linux shell, a simple check for lex/flex and yacc/bison came with three out of the four tools available. From this, we determine what our syntax will be for the language. We assume here that the language is comprised of numbers which is one or more digits (zero through nine) and one identifier which is comprised of one character from the alphabet, then any number of alphanumeric characters. From this decision we create the syntax: NUMBER + IDENTIFIER, and from that syntax we can be more specific and have [0-9] + [a-z] [a-z0-9] *. Using the code above, which implements the specifications stated, we have all the code we need to show the power of these tools. In Figure 8 is the commands needed to use flex/bison (which is much like that of lex/yacc) and compile the outputs. By making mistakes in the above example, can one truly see the power of these tools. Reversing the name of the BNF rule and the token can cause a Shift / Reduce error. This error is due to the fact that bison is a LALR(1) parser, meaning that it looks from the left to right. If the leftmost rule repeats itself, then we have infinite recursion and that causes a problem. By having the rules in the other order, than the error is fixed and bison compiles correctly. Below in Figure 7, we have how the rule looks incorrect and correct. 17

Joshua Urbain, Morteza Marzjarani numbers : /* empty */ numbers NUMBER ; numbers : /* empty */ NUMBER numbers ; Correct Incorrect, shift / reduce error Figure 7 The incorrect ordering of the syntax rule and token causes a shift error due to the nature of the parser. [test:~/code]$ bison -dv ex.y [test:~/code]$ gcc -c ex.tab.c [test:~/code]$ flex ex.lex [test:~/code]$ gcc -c lex.yy.c [test:~/code]$ gcc -o ex ex.tab.o lex.yy.o -lm [test:~/code]$ ex test Starting parse Entering state 0 Reducing via rule 2 (line 13), -> numbers state stack now 0 Entering state 2 Reading a token: 12221 Next token is token NUMBER () Shifting token 258 (NUMBER), Entering state 4 Reducing via rule 3 (line 14), numbers NUMBER -> numbers state stack now 0 Entering state 2 Reading a token: Next token is token IDENTIFIER () Shifting token 259 (IDENTIFIER), Entering state 5 Reducing via rule 1 (line 11), numbers IDENTIFIER -> program state stack now 0 Entering state 1 Reading a token: 12 Next token is token NUMBER () parse error Error: popping nterm program () Error: state stack now 0 Parse Completed with 1 errors. Figure 8 By using these commands, we already have a way to parse and test the syntax. In Anthony Aaby s Compiler Construction using Flex and Bison (where some of the code from above is based from), he takes this code to the next step by adding a symbol table, code generator, and a stack machine, as well as a mass of commands (our example above was a proof of concept to show these tools in their more primitive form). These parts work as the other components of a compiler to create an entire compiler. The symbol table is used to store the tokens of importance, while the code generator and the stack machine are used together to create appropriate assembly commands from the code used. From the output of this compiler there is code which can be used to achieve the input given, as a normal compiler functions. Looking through components available for compiler creation, the other parts are either trivial or rather hard to make generic enough to be available. A symbol table is an easily accessible data structure and does not need a component to build this object (although, one is sure doing an easy internet search would yield results for this). The symbol table could be created using a linked list or even an associative array (both of these structures are sometimes included in programming languages as predefined data structures for use). A code generator is a difficult part to

Component Compilers make for the generalization. When searching if any code generators like Yacc or Lex were available, none were found. One would assume this lack of generalized code generators is due to the diversity of languages and the code generation being specific for each construct. One would think that the overhead of making code generation generalized would be quite high and make optimization painful. Code optimization algorithms are easily accessible, but there is no program to automate this, since this is another language-specific / compiler-specific task. Stephen C. Johnson, who developed Yacc went on to develop Lint, which is known for aiding compiler optimization. By examining the part and components used to create a compiler, we can come to the realization that Yacc and Lex speed the work of building a compiler. Even though a sector of work has been done, there are still fractions of code that must be implemented in order to get a compiler up and running. One would be willing to bet this saves a great deal of work compared to if we didn t have these components. In Figure 9 below, we see a basic diagram of the process of compiling a program, by using these tools; we can see that almost every step except for code generation and optimization are fulfilled. For the price, this process of using these tools appears to be the best possible way to get started with creating a compiler and getting it ready to use. Figure 9 A compiler has many steps which can be turned into components to simplify the task (Image from Wikipedia). References Aaby, Anthony A. Compiler Construction Using Flex and Bison. College Place: Walla Walla College, 2003. 1-102. Aho, Alfred V., and Jeffrey D. Ullman. Principles of Compiler Design. 1st ed. Addison-Wesley, 1979. 1-604. Aho, Alfred V., Jeffrey D. Ullman, and Ravi Sethi. Compilers: Principles, Techniques, and Tools. 1st ed. Pearson Education, 1986. 1-795. 19

Joshua Urbain, Morteza Marzjarani "Code generation (compiler)." Wikipedia, The Free Encyclopedia. 1 Jul 2007, 21:47 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=code_generation_%28compiler%29&oldid=141880234>. "Compiler." Wikipedia, The Free Encyclopedia. 2 Aug 2007, 16:14 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=compiler&oldid=148743199>. "Flex++." Wikipedia, The Free Encyclopedia. 18 Jul 2007, 04:13 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=flex%2b%2b&oldid=145370464>. "GNU bison." Wikipedia, The Free Encyclopedia. 4 Aug 2007, 11:58 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=gnu_bison&oldid=149128501>. Johnson, Stephen C. "Lex - a Lexical Analyzer Generator." The Lex & Yacc Page. AT&T Bell Laboratories. 10 Aug. 2007 <http://dinosaur.compilertools.net/lex/index.html>. Lesk, M E., and E Schmidt. "Lex - a Lexical Analyzer Generator." The Lex & Yacc Page. 10 Aug. 2007 <http://dinosaur.compilertools.net/lex/index.html>. "Lex programming tool." Wikipedia, The Free Encyclopedia. 27 Jul 2007, 23:02 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=lex_programming_tool&oldid=147554577>. "Lint (software)." Wikipedia, The Free Encyclopedia. 15 Jul 2007, 19:59 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=lint_%28software%29&oldid=144850041>. Niemann, Tom. A Compact Guide to Lex & Yacc. Epaperpress.Com. 1-40. "Optimization (computer science)." Wikipedia, The Free Encyclopedia. 8 Aug 2007, 18:37 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=optimization_%28computer_science%29&oldid=150022368>. "Stephen C. Johnson." Wikipedia, The Free Encyclopedia. 24 May 2007, 17:15 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=stephen_c._johnson&oldid=133210252>. "Yacc." Wikipedia, The Free Encyclopedia. 15 Jul 2007, 20:15 UTC. Wikimedia Foundation, Inc. 10 Aug 2007 <http://en.wikipedia.org/w/index.php?title=yacc&oldid=144852930>.