UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

Similar documents
Semantics of programming languages

Semantics of programming languages

Semantics of programming languages

CS/ECE 374 Fall Homework 1. Due Tuesday, September 6, 2016 at 8pm

CS 164 Handout 11. Midterm Examination. There are seven questions on the exam, each worth between 10 and 20 points.

Assignment 4 CSE 517: Natural Language Processing

CS2 Algorithms and Data Structures Note 10. Depth-First Search and Topological Sorting

Homework 1 Due Tuesday, January 30, 2018 at 8pm

CpSc 421 Final Solutions

CSE 401/M501 18au Midterm Exam 11/2/18. Name ID #

Definition: A context-free grammar (CFG) is a 4- tuple. variables = nonterminals, terminals, rules = productions,,

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2016

Midterm I - Solution CS164, Spring 2014

Lexical Analysis 1 / 52

Midterm I (Solutions) CS164, Spring 2002

2.2 Syntax Definition

CSE 401 Midterm Exam Sample Solution 2/11/15

Midterm Exam. CSCI 3136: Principles of Programming Languages. February 20, Group 2

Semantics via Syntax. f (4) = if define f (x) =2 x + 55.

14.1 Encoding for different models of computation

CS164: Midterm I. Fall 2003

Computer Science 236 Fall Nov. 11, 2010

Compilers and computer architecture From strings to ASTs (2): context free grammars

2 Ambiguity in Analyses of Idiomatic Phrases

UVa ID: NAME (print): CS 4501 LDI Midterm 1

CSCE 314 Programming Languages

Theory of Computation Dr. Weiss Extra Practice Exam Solutions

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

Lecture 2 Finite Automata

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

Ambiguous Grammars and Compactification

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

1. Draw the state graphs for the finite automata which accept sets of strings composed of zeros and ones which:

CSE P 501 Exam 8/5/04

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Automatic generation of LL(1) parsers

LING/C SC/PSYC 438/538. Lecture 3 Sandiway Fong

LL(1) predictive parsing

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

15 212: Principles of Programming. Some Notes on Grammars and Parsing

CS143 Midterm Fall 2008

Reflection in the Chomsky Hierarchy

Recursively Enumerable Languages, Turing Machines, and Decidability

Languages and Compilers

LL(1) predictive parsing

Types and Static Type Checking (Introducing Micro-Haskell)

CS415 Compilers. Lexical Analysis

I have read and understand all of the instructions below, and I will obey the Academic Honor Code.

CS 125 Section #10 Midterm 2 Review 11/5/14

This book is licensed under a Creative Commons Attribution 3.0 License

Lexical Analysis. Lecture 3-4

CS103 Handout 13 Fall 2012 May 4, 2012 Problem Set 5

CS103 Handout 14 Winter February 8, 2013 Problem Set 5

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Module Title: Informatics 2A Exam Diet (Dec/April/Aug): Aug 2016 Brief notes on answers: 1. (a) The equations are

1 Parsing (25 pts, 5 each)

Midterm Exam II CIS 341: Foundations of Computer Science II Spring 2006, day section Prof. Marvin K. Nakayama

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

AXIOMS FOR THE INTEGERS

Question Marks 1 /20 2 /16 3 /7 4 /10 5 /16 6 /7 7 /24 Total /100

CSE P 501 Exam 11/17/05 Sample Solution

Front End: Lexical Analysis. The Structure of a Compiler

UNIT -2 LEXICAL ANALYSIS

1. Consider the following program in a PCAT-like language.

Lexical Analysis. Chapter 2

Compiler Construction

CSE 413 Final Exam. June 7, 2011

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08013 INFORMATICS 1 - FUNCTIONAL PROGRAMMING

CS 181 EXAM #1 NAME. You have 90 minutes to complete this exam. You may state without proof any fact taught in class or assigned as homework.

CS323 Lecture - Specifying Syntax and Semantics Last revised 1/16/09

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Examination in Compilers, EDAN65

DVA337 HT17 - LECTURE 4. Languages and regular expressions

CS154 Midterm Examination. May 4, 2010, 2:15-3:30PM

Introduction to Lexical Analysis

Final Project Discussion. Adam Meyers Montclair State University

Types and Static Type Checking (Introducing Micro-Haskell)

CSE Qualifying Exam, Spring February 2, 2008

CSE 582 Autumn 2002 Exam 11/26/02

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

LL Parsing, LR Parsing, Complexity, and Automata

Part 5 Program Analysis Principles and Techniques

Lexical Analysis. Lecture 2-4

University of Nevada, Las Vegas Computer Science 456/656 Fall 2016

Lexical Analysis. Introduction

Language Processing note 12 CS

CS 44 Exam #2 February 14, 2001

Dependency grammar and dependency parsing

CS103 Handout 42 Spring 2017 May 31, 2017 Practice Final Exam 1

Intro to semantics; Small-step semantics Lecture 1 Tuesday, January 29, 2013

Theory of Computations Spring 2016 Practice Final Exam Solutions

Tail Calls. CMSC 330: Organization of Programming Languages. Tail Recursion. Tail Recursion (cont d) Names and Binding. Tail Recursion (cont d)

1 Connected components in undirected graphs

Theory Bridge Exam Example Questions Version of June 6, 2008

Name: CS 341 Practice Final Exam. 1 a 20 b 20 c 20 d 20 e 20 f 20 g Total 207

announcements CSE 311: Foundations of Computing review: regular expressions review: languages---sets of strings

Syntax and Grammars 1 / 21

Transcription:

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES Saturday 10 th December 2016 09:30 to 11:30 INSTRUCTIONS TO CANDIDATES 1. Answer all five questions in Part A, and two out of three questions in Part B. Each question in Part A is worth 10% of the total exam mark; each question in Part B is worth 25%. 2. Use a single script book for all questions. 3. Calculators may be used in this exam. Convener: I. Simpson External Examiner: I. Gent THIS EXAMINATION WILL BE MARKED ANONYMOUSLY

Part A 1. (a) Draw the state diagram for a simple NFA N over the alphabet {a,..., z} that accepts precisely the strings ending in lalla. The states should be labelled as 0, 1, 2,... in the obvious way. You may use obvious abbreviations wherever a large number of very similar transitions appears. (b) Use the subset construction to convert N to an equivalent DFA M. You need only include the reachable states in your diagram. Label each state of M to show which set of states from N it corresponds to. Again, abbreviations are acceptable. (c) How may we use M to identify all occurrences of the substring lalla within a longer string over {a,..., z}? [2 marks] [7 marks] [1 mark] 2. (a) State the Pumping Lemma for regular languages. [3 marks] (b) Consider the language L = {a m b n m n} over the alphabet {a, b}. What difficulty do we encounter if we try to use the Pumping Lemma directly to show that L is not regular? It will suffice here to illustrate the failure of one plausible attempt at doing this, pinpointing where a problem arises. (c) Give an indirect proof that the above language L is not regular, by using standard closure properties of regular languages and appealing to a standard example of a non-regular language. 3. Consider the following emission and transition probabilities for parts of speech. (For simplicity, we use MOD, a generic part of speech standing in for adjectives and adverbs). Assume every sequence must be preceded by START and followed by STOP. All were the UNK DT 1.0 MOD 0.2 0.8 NN 0.5 VB 0.8 0.2 from to DT MOD NN VB STOP START 0.5 0.3 0.2 DT 0.3 0.7 MOD 0.5 0.3 0.2 NN 0.5 0.5 VB.03 0.3 0.3 0.1 The UNK token in the emission table should be substituted for any input word that does not otherwise appear in the table. Now suppose that we receive this line from Lewis Carroll s Jabberwocky as input. All mimsy were the borogoves [3 marks] Using the tables above, use the Viterbi algorithm to determine the most probable tag sequence an HMM tagger would assign to this sentence. Show your work. [10 marks] Page 1 of 7

4. Consider the following grammar. S NP VP (1.0) VP VB NP PP (0.4) VB NP (0.6) PP Prep NP (1.0) NP NP PP (0.3) NN (0.7) NN scientists (0.3) space (0.5) whales (0.2) VB count (1.0) Prep from (1.0) Now suppose we have the following input sentence, a February 2014 headline from the BBC News: Scientists count whales from space Draw all possible parse trees for this sentence and compute their probabilities. Which parse would the parser choose? [10 marks] 5. A polarity item is a word or phrase that can only appear in either a positive or negative context. For example, somewhat can appear only in positive contexts, while at all can appear only in negative contexts: you like the film somewhat you do not like the film at all * you do not like the film somewhat * you like the film at all (Recall that an example sentence preceded by * is one that is considered ungrammatical). Now suppose that we have the following grammar in which nonterminals are capitalized, terminals are lowercase, and the start symbol is S: S NP VP VP VB ADVP NEG VB ADVP NP you the film NEG do not VB like ADVP somewhat at all This grammar accepts both the grammatical and ungrammatical sentences above. You should redesign it so that it accepts the two grammatical sentences, and rejects the ungrammatical ones. (a) Design a new grammar without using agreement features. (b) Design a new grammar using agreement features. (c) Explain how the grammars with and without agreement features differ, and whether the difference is functionally important. (Hint: do the grammars generate the same string languages? What about their analyses?) [2 marks] Page 2 of 7

Part B 6. In typical Unix-style command shells, a Java program may be executed using the command java in conjunction with the name of the program to be run. The latter may be either the name of a class file, or (if preceded by -jar) the name of a Java archive file containing the program. In addition, the command may specify a number of run-time options (these appear before the name of the program), as well as a list of arguments to be passed to the program itself (these appear after the name of the program). A simplified version of the syntax of such commands is given by the following LL(1) grammar. The start symbol is command, and the six terminals are java - -jar : = str For the purpose of this question, we shall treat -jar as a single lexical item, although in other contexts, - will be treated as a token by itself. The terminal str stands for a single lexical class of strings which we shall use for many different purposes: as file names, option names, option values and argument values. The productions of the grammar are as follows: command java opts file args opts ɛ opt opts opt - str qualifier qualifier ɛ : str = str file str -jar str args ɛ str args (a) Write down the set E of potentially empty non-terminals for this grammar. Using this, calculate the First sets for all of the non-terminals in the grammar. (b) Calculate the Follow sets for all the non-terminals in the grammar. (This is more demanding than part (a) and hence carries more credit.) (c) Using these First and Follow sets, or else by simple inspection, construct the LL(1) parse table for the grammar. You may use either a double page or a single page in landscape orientation for this. (d) Suppose now that we attempt to parse the sequence of tokens [6 marks] [9 marks] java str -jar str via the standard LL(1) parsing algorithm, using your parse table from part (c). QUESTION CONTINUES ON NEXT PAGE Page 3 of 7

QUESTION CONTINUED FROM PREVIOUS PAGE Construct a table showing how the computation proceeds, displaying at each step the operation performed, the input remaining, and the state of the stack. Continue with the computation until either a successful parse is achieved, or some error is encountered. If the latter occurs, indicate precisely where and how the error is detected by the parser. [6 marks] Page 4 of 7

7. This question explores the notion of interleaving of regular languages, a concept not formally defined within the course. By an interleaving of two strings s and t, we mean any string u whose symbol occurrences may each be tagged with either 0 or 1 in such a way that the symbols tagged with 0 (in order) spell out precisely the string s, and those tagged with 1 spell out t. For example, each of the following is a possible interleaving of the strings moon and note: moonnote notemoon monotone nomtoeon By the interleaving of two languages L 0 and L 1, we mean the language consisting of all possible interleavings of a string from L 0 and a string from L 1 : L 0 L 1 = {u s L 0, t L 1. u is an interleaving of s and t }. (a) Consider the following NFAs N 0, N 1 over the alphabet {a, b, c}: N 0 : N 1 : a b c c Draw the state diagram for an NFA N such that L(N) = L(N 0 ) L(N 1 ), where is the interleaving operation defined above. (Hint: Your NFA should have a state for every pair of states (q 0, q 1 ) where q i is a state of N i for i = 0, 1. However, it should clearly not be the same as the NFA corresponding to the intersection L(N 0 ) L(N 1 ).) (b) Now given two arbitrary NFAs N 0 = (Q 0, 0, S 0, F 0 ), N 1 = (Q 1, 1, S 1, F 1 ) over the same alphabet Σ, give a general mathematical definition of an NFA N = (Q,, S, F ) such that L(N) = L(N 1 ) L(N 2 ). Your definition should specify the set of states Q, the transition relation Q Σ Q, the set of start states S Q, and the set of accepting states F Q. (c) Given an alphabet Σ, let L(Σ) denote the set of strings s such that every symbol a Σ occurs exactly once within s. Show how L(Σ) may be obtained from a family of extremely simple languages via interleaving. Hence show how one may construct an NFA N Σ for L(Σ). How many states does this NFA have? (d) State what it means for an NFA to be minimal (noting that the usual definition for DFAs makes sense also for NFAs). Is your NFA N Σ from part (c) minimal? Briefly justify your answer. QUESTION CONTINUES ON NEXT PAGE b Page 5 of 7

QUESTION CONTINUED FROM PREVIOUS PAGE (e) Given a string s, let M s denote the obvious minimal DFA that accepts the string s and nothing else. By applying the construction from part (b) to two copies of M s, we obtain a machine Ms 2 that accepts the language L 2 s consisting of all interleavings of s with itself. Show that for any k 0, there is a string s such that Ms 2 has more than k times the number of states in the minimal DFA for L 2 s. Page 6 of 7

8. In English, the respectively construction relates two ordered sets of n words each, such that the ith word of the first set is related to the ith word of the second, for all i from 1 to n. For example, suppose we see the sentence: a wall, table, and floor are yellow, green, and blue, respectively, This means that a wall is yellow, a table is green, and a floor is blue. Before we delve into the construction, let s first remind ourselves how semantics are assigned to easier sentences. Suppose that we have the following highly simplified grammar: S a NP is JJ NN wall table floor JJ blue green yellow (a) The above grammar generates simple sentences without the respectively construction, such as a table is green. We would like this sentence to have the semantics x.table(x) Green(x). Add lambda terms to the grammar so that it computes the correct semantics for each sentence that it generates. (Note: for this exercise, it is only important that your lambda terms compute to the correct semantics. It doesn t matter whether they are the same as those a semanticist would write.) (b) Now we will develop a simple version of the respectively construction. Modify the grammar (including the lambda terms) so that it will accept sentences of the form a NN and NN are JJ and JJ, respectively and assign the correct semantics to them. For example, if your parser encounters the sentence a table and wall are blue and yellow, respectively, it should assign the correct semantics: x. y.table(x) Blue(x) Wall(y) Yellow(y). (c) Let s move to a more general form of the respectively construction. For this part, modify your syntactic grammar (don t worry about semantics yet) so that it will accept sentences of the form a NN,..., NN and NN are JJ,..., JJ and JJ, respectively, for equal numbers of NN and JJ categories. If it s helpful, you can assume that the comma token (,) has category CC (coordinating conjunction). It is ok if your grammar accepts other sentences, for instance replacing comma (,) with and. (d) As we ve seen in lecture, all regular languages are expressible as contextfree grammars, and indeed our original grammar and the modified grammar grammar of (8b) produce regular languages. What about the langage of (8c)? Is it regular? It suffices to give an intuitive justification; a formal proof is not needed. (e) Add lambda terms to your grammar, or if you are unable to, explain informally what problem arises when you try to do so. (f) Does the respectively construction remind you of any linguistic phenomena we discussed in class? How is it similar? How is it different? [3 marks] [3 marks] Page 7 of 7