Cooking flex with Perl

Similar documents
The Perl Review. Volume 0 Issue 3 May 1, April showers bring May flowers. Letters to The Perl Review. About this issue.

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Yacc: A Syntactic Analysers Generator

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

LECTURE 3. Compiler Phases

2068 (I) Attempt all questions.

COMPILER CONSTRUCTION Seminar 02 TDDB44

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

LECTURE 7. Lex and Intro to Parsing

Compiler Design (40-414)

Lexical and Syntax Analysis

TDDD55 - Compilers and Interpreters Lesson 3

COP 3402 Systems Software Syntax Analysis (Parser)

Automatic Scanning and Parsing using LEX and YACC

Introduction to Parsing. Lecture 8

(F)lex & Bison/Yacc. Language Tools for C/C++ CS 550 Programming Languages. Alexander Gutierrez

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

CS164: Midterm I. Fall 2003

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

CST-402(T): Language Processors

Syntax-Directed Translation. Lecture 14

Project 1: Scheme Pretty-Printer

LECTURE 11. Semantic Analysis and Yacc

A Bison Manual. You build a text file of the production (format in the next section); traditionally this file ends in.y, although bison doesn t care.

CSC 467 Lecture 3: Regular Expressions

Introduction to Compiler Design

CMSC445 Compiler design Blaheta. Project 2: Lexer. Due: 15 February 2012

Lab 2. Lexing and Parsing with Flex and Bison - 2 labs

Compiler Lab. Introduction to tools Lex and Yacc

COMPILATION ARTIST Lexers and parsers aren t only for the long-bearded gurus. We ll show you how you can build a parser for

COMPILERS AND INTERPRETERS Lesson 4 TDDD16

Compilers. Prerequisites

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Compilers and computer architecture From strings to ASTs (2): context free grammars

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Introduction to Lex & Yacc. (flex & bison)

Syntax Analysis Part IV

An Introduction to LEX and YACC. SYSC Programming Languages

Semantic Analysis. Lecture 9. February 7, 2018

UNIT III & IV. Bottom up parsing

Preparing for the ACW Languages & Compilers

Profile Language Compiler Chao Gong, Paaras Kumar May 7, 2001

TDDD55- Compilers and Interpreters Lesson 3

Question Bank. 10CS63:Compiler Design

Implementation of a simple calculator using flex and bison

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

Compiler Construction D7011E

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Syntax. A. Bellaachia Page: 1

Chapter 10 Language Translation

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573

What is a compiler? var a var b mov 3 a mov 4 r1 cmpi a r1 jge l_e mov 2 b jmp l_d l_e: mov 3 b l_d: ;done

Object-oriented Compiler Construction

Syntax Analysis. Chapter 4

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

A parser is some system capable of constructing the derivation of any sentence in some language L(G) based on a grammar G.

Parse::Eyapp::translationschemestut - Introduction to Translation Schemes in Eyapp

Ray Pereda Unicon Technical Report UTR-03. February 25, Abstract

CSE 12 Abstract Syntax Trees

Programming Assignment II

Compil M1 : Front-End

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 1

Decaf PP2: Syntax Analysis

Working of the Compilers

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

A simple syntax-directed

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

Compiler construction in4020 lecture 5

Introduction to Parsing Ambiguity and Syntax Errors

Parsing and Pattern Recognition

Lexical and Parser Tools

I. OVERVIEW 1 II. INTRODUCTION 3 III. OPERATING PROCEDURE 5 IV. PCLEX 10 V. PCYACC 21. Table of Contents

PRACTICAL CLASS: Flex & Bison

A clarification on terminology: Recognizer: accepts or rejects strings in a language. Parser: recognizes and generates parse trees (imminent topic)

Lecture 09: Data Abstraction ++ Parsing is the process of translating a sequence of characters (a string) into an abstract syntax tree.

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

UNIVERSITY OF CALIFORNIA

Introduction to Parsing Ambiguity and Syntax Errors

Better error handling using Flex and Bison

Languages and Compilers

Group B Assignment 9. Code generation using DAG. Title of Assignment: Problem Definition: Code generation using DAG / labeled tree.

A Simple Syntax-Directed Translator

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Syntax Analysis The Parser Generator (BYacc/J)

CSE 413 Languages & Implementation. Hal Perkins Winter 2019 Structs, Implementing Languages (credits: Dan Grossman, CSE 341)

CS 406/534 Compiler Construction Putting It All Together

Compiling and Interpreting Programming. Overview of Compilers and Interpreters

CS 297 Report. By Yan Yao

PART ONE Fundamentals of Compilation

TDDD55- Compilers and Interpreters Lesson 2

Extending Perl: an introduction

Parsing How parser works?

Programming Project II

About the Authors... iii Introduction... xvii. Chapter 1: System Software... 1

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Transcription:

Cooking flex with Perl Alberto Manuel Simões (albie@alfarrabio.di.uminho.pt) Abstract There are a lot of tools for parser generation using Perl. As we know, Perl has flexible data structures which makes it easy to generate generic trees. While it is easy to write a grammar and a lexical analyzer using modules like Parse::Yapp and Parse::Lex, this pair of tools is not as efficient as I would like. In this document I ll present a way to cook quickly Parse::Yapp with the better lexical analyzer I know: flex. 1 Parser structure Before showing my results, I should make a brief description of what s the standard built method for a parser, and how it works. As you should use Perl or any other programming language, I can assume you know or at least have used a parser. When I write a program in any language and it s compiler generates code or interpret it, this process has a parser involved. It takes the code you write, splits it on tokens (strings of characters with a special meaning in the language) and then tries to match these token sequence with a grammar. We call a lexical analyzer to a program which splits the program into pieces. We use regular expressions to match each token, which will be returned, one by one, to the caller program. Commonly, we return not only the token type (a name) but also the string which correspond to the matched token. The other step, putting these pieces together and check if their sequence makes any logic, is done by another program, called a syntactic analyzer. It uses a grammar (a sequence of rules) to construct a sentence and check if all tokens are used, and no more are needed. While checking this, it is built a tree in memory with the program structure. To generate code, or interpret it, the program will use make traversals over this tree. Traditional tools for generating parsers in C are lex and yacc (or their variants, flex and bison). When using Perl, I can say there are no traditional tools. Indeed, there are a lot of different ways people implement parsers. 2 Why flex Some time ago, I needed make a dictionary programming language parser more efficient. The original parser was written with a patched Berkeley Yacc (byacc) which can generate Perl from a grammar specification like common yacc. The only difference resides on the semantic actions, written with Perl. The lexical analyzer was written with simple Perl regular expressions. This combination works, but loading the complete dictionary took a lot of time. The first approach was to convert the grammar to a full Perl parser. I could choose tools like Parse::RecDescent, Parse::YALALR, Parse::Yapp or others. I choose Parse::Yapp not because of the efficiency but because its syntax and functionality is very similar to the old yacc. An half an hour and I had a new working version of the parser, but only less one or two seconds of time for the parsing task. This small speed-up occurs because byacc was generating a Perl program with C structure. Parse::Yapp creates full Perl programs, taking advantage of Perl facilities. Next, if I can t make the syntax analyzer quicker, I can try to change the lexical one. The first idea was to use Parse::Lex but since it used Perl regular expressions I decided to find another 1

option. I really like Perl, but started thinking on a C implementation for the Parser. That would take a long time but I heard about the XSUB concept and start working on a flex specification to glue with Parse::Yapp. This glue can be implemented using different techniques. For example, I could write the lexical analyzer returning integers, where each one represents a different grammar terminal symbol. Another approach was returning a string with the name of the symbol. While this can be less efficient than the first one, it makes the grammar more legible. This is the reason why I choose it. I implemented this flex analyzer and glued it with Parse::Yapp. It ran in less than 50% of parsing time. byacc + perl re yapp + flex 27.625 sec 14.161 sec 1 %{ flex is very efficient, but Perl is very flexible. Putting together these two tools makes parse generation simple, quick and efficient. 3 The Recipe Because the dictionary programming language is very complex, I decided to give a simple arithmetic parser example to make it simple. This recipe assumes you know the basics about parser generation. 3.1 Writing the Grammar First, like in other languages, we need to write the grammar and the lexical analyzer. Which to write first, depend on your way of working but, normally we start with the grammar to find the terminal symbols. The Parse::Yapp 1 syntax is like yacc: 2 # Perl initialization code 3 %} 4 # Yapp commands 5 %% 6 # The Grammar with perl semantic actions 7 %% 8 # Perl functions Let s write a grammar for arithmetic expressions. Isn t that exciting? I hope this is a straightforward grammar, with the only objective of showing the Parse::Yapp syntax. Edit a mygrammar.yp file and write: 1 %token NUMBER NL 2 3 %left - + 4 %left * / 5 6 %% 7 command: exp NL { return $_[1] } 8 ; 9 10 exp: exp + exp { return $_[1]+$_[3]} 1 As Parse::Yapp module does not come installed with Perl, you have to get it from CPAN. 2

11 exp - exp { return $_[1]-$_[3]} 12 exp * exp { return $_[1]*$_[3]} 13 exp / exp { return $_[1]/$_[3]} #forget x/0 14 number { return $_[1]} 15 ; 16 17 number: NUMBER { return $_[1] } 18 ; 19 %% 1 %{ 3.2 Writing the Lexical Analyser Next, let s write the lexical analyser in C. I hope you remember this assembler language :-) Create the file mylex.l with: 2 #define YY_DECL char* yylex() void; 3 %} 4 5 char buffer[15]; 6 7 %% 8 [0-9]+ { return strcpy(buffer, "NUMBER"); } 9 10 \n { return strcpy(buffer, "NL"); } 11 12. { return strcpy(buffer, yytext); } 13 14 %% 15 int perl_yywrap(void) { 16 return 1; 17 } 18 char* perl_yylextext(void) { 19 return perl_yytext; 20 } The first three lines define the prototype for the lexical analyzer. As said before, I want to return strings instead of the normal integers returned by default. Then, I have to put these strings somewhere. One option could be to allocate memory each time a token is found. In this example, I allocate one buffer where I will put the token information. You must remember that using a hardcoded 15 character buffer implies that you use smaller strings to terminal symbols. The next section is a normal lexical analyzer, returning the name of the token or of the character found. The last pair or functions are used to glue the parser with the lexical analyzer. The first one is used only to restart the parsing task while the second is used to access the text which matched the regular expression. 3.3 Writting the glue At this point I have the two main tools written. It misses, only, the glue. The easiest way to do this is creating a module for our parser with the h2xs tool: h2xs -n myparser Then, copy the flex source and the Parse::Yapp grammar to the module directory (myparser/mylex.l and myparser/mygrammar.yp). 3

Parse::Yapp expects a yylex function to return a pair, the name of the token and the matched text. To accomplish this, add the following function to your module (myparser.pm), just before the line containing 1;. 1 sub perl_lex { 2 my $token = perl_yylex(); 3 if ($token) { 4 return ($token, perl_yylextext()) 5 } else { 6 return (undef, ""); 7 } 8 } If you remember, I wrote perl_yylextext, but perl_yylex is generated by flex. I will need an error handling function, too! The error function will print what token was read and the token it was expecting. This function should be added after the perl lex function. 1 sub perl_error { 2 my $self = shift; 3 warn "Error: found ",$self->yycurtok, 4 " and expecting one of ",join(" or ",$self->yyexpect); 5 } Now, edit your myparser.xs file. It contains glue code to map C function into Perl ones. Don t bother the already written code. Add only an include file after the existing ones: #include "mylex.h" and add this glue at the end of the file: 1 char* 2 perl_yylex() 3 OUTPUT: 4 RETVAL 5 6 char* 7 perl_yylextext() 8 OUTPUT: 9 RETVAL Note that spaces and new lines are important. The first line for each function is the return type. It follows a line with the name of the function and, in the case of arguments, one line for each argument (like old C style). Then, there has to be two lines showing Perl where to get the return value from the C functions. If you noticed the include directive I added, the file does not exist! Create the mylex.h file and insert these prototypes: 1 char* perl_yylex(void); 2 char* perl_yylextext(void); We are done with the mapping of C functions to Perl ones. Now, we have to map data types. Integers are trivial and handled directly by perl, but we have pointers to characters. In this case, we have to create the file typemap in the module root directory with the text: char* T_PV This maps pointers to characters to the built-in T PV Perl type. 4

3.4 Putting it all together Allright, you done it! The only problem is that the standard Makefile.PL does not knows how to compile this stuff. I changed my Makefile.PL to look like this: 1 use ExtUtils::MakeMaker; 2 3 $YACC_COMMAND = "( yapp -o mygrammar.pm -m mygrammar mygrammar.yp)"; 4 5 # This is needed before WriteMakefile 6 $YACC_COMMAND ; 7 8 WriteMakefile( 9 NAME => myparser, 10 VERSION_FROM => myparser.pm, 11 LIBS => [ -lfl ], # This is for flex 12 MYEXTLIB => lib.so, # Our lexer 13 ); 14 15 sub MY::postamble { 16 " 17 \$(MYEXTLIB): lex.perl_yy.c myparser.pm 18 \t\$(cc) -c lex.perl_yy.c 19 \t\$(ar) cr \$(MYEXTLIB) lex.perl_yy.o 20 \tranlib \$(MYEXTLIB) 21 22 mygrammar.pm: mygrammar.yp 23 \t$yacc_command 24 25 lex.perl_yy.c: mylex.l 26 \tflex -Pperl_yy mylex.l 27 "; 28 } 3.5 Testing Now, compile it the standard way: 1 perl Makefile.PL 2 make 3 make test The test stuff didn t show off if this really works. Let us add some lines to the end of the test script (test.pl file): 1 use mygrammar; 2 my $parser = new mygrammar; 3 my $result = $parser->yyparse(yylex => \&myparser::perl_lex, 4 yyerror => \&myparser::perl_error); 5 print $result; Now, make test again and write 1+4*2+3, press enter and hit C^d. You did it! 5

4 Conclusions While Perl is really flexible, there are tools which do some tasks quicker than Perl. This is because Perl is an interpreted language, and tools like flex generate an optimized finite automata program written in C. This means we should not re-invent the wheel, and use these tools from Perl. 4.1 See also Check these man pages: flex(1), perlguts(1), perlxs(1), perlxstut(1) and Parse::Yapp(3). 6