Exam Marco Kuhlmann. This exam consists of three parts:

Similar documents
Transition-based dependency parsing

Dependency grammar and dependency parsing

Dependency grammar and dependency parsing

Dependency grammar and dependency parsing

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

Introduction to Hidden Markov models

Collins and Eisner s algorithms

The Expectation Maximization (EM) Algorithm

CMPSCI 311: Introduction to Algorithms Practice Final Exam

Query Difficulty Prediction for Contextual Image Retrieval

Part-of-Speech Tagging

Natural Language Processing with Deep Learning CS224N/Ling284

A Hybrid Neural Model for Type Classification of Entity Mentions

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019

MEMMs (Log-Linear Tagging Models)

Syntactic N-grams as Machine Learning. Features for Natural Language Processing. Marvin Gülzow. Basics. Approach. Results.

Automatic Summarization

The CKY algorithm part 1: Recognition

UML CS Algorithms Qualifying Exam Fall, 2003 ALGORITHMS QUALIFYING EXAM

A Bambara Tonalization System for Word Sense Disambiguation Using Differential Coding, Segmentation and Edit Operation Filtering

Detection and Extraction of Events from s

Evaluating Classifiers

A Hidden Markov Model for Alphabet Soup Word Recognition

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

The CKY algorithm part 1: Recognition

Log- linear models. Natural Language Processing: Lecture Kairit Sirts

MPI-INF AT THE NTCIR-11 TEMPORAL QUERY CLASSIFICATION TASK

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

NLP Final Project Fall 2015, Due Friday, December 18

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Scalable Trigram Backoff Language Models

Design and Analysis of Algorithms 演算法設計與分析. Lecture 7 April 15, 2015 洪國寶

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Question Answering Systems

Modeling Sequence Data

Data Analytics. Qualification Exam, May 18, am 12noon

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

Homework 2: HMM, Viterbi, CRF/Perceptron

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

1 Motivation for Improving Matrix Multiplication

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING

Lecturers: Sanjam Garg and Prasad Raghavendra March 20, Midterm 2 Solutions

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

slide courtesy of D. Yarowsky Splitting Words a.k.a. Word Sense Disambiguation Intro to NLP - J. Eisner 1

FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples

6.867 Machine Learning

The CKY algorithm part 2: Probabilistic parsing

Dependency Parsing CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

(Refer Slide Time: 01:00)

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Assignment 4 CSE 517: Natural Language Processing

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

COMPSCI 311: Introduction to Algorithms First Midterm Exam, October 3, 2018

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Conditional Random Fields for Object Recognition

Accelerometer Gesture Recognition

Statistical Methods for NLP

Final Project Discussion. Adam Meyers Montclair State University

Efficient Sequential Algorithms, Comp309. Problems. Part 1: Algorithmic Paradigms

10-701/15-781, Fall 2006, Final

Solution to the example exam LT2306: Machine learning, October 2016

Handwritten Text Recognition

Solutions for Operations Research Final Exam

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

Natural Language Processing. SoSe Question Answering

Module 3: GATE and Social Media. Part 4. Named entities

1 Document Classification [60 points]

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Exact Inference: Elimination and Sum Product (and hidden Markov models)

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

Memoization/Dynamic Programming. The String reconstruction problem. CS124 Lecture 11 Spring 2018

Dynamic Feature Selection for Dependency Parsing

Measurements of the effect of linear interpolation values and reduced bigram model size for text prediction

Introduction to Machine Learning Prof. Mr. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

NLP in practice, an example: Semantic Role Labeling

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

Named Entity Detection and Entity Linking in the Context of Semantic Web

Final Exam in Algorithms and Data Structures 1 (1DL210)

A Quick Guide to MaltParser Optimization

CS 224N Assignment 2 Writeup

Detecting code re-use potential

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop

Parsing with Dynamic Programming

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Building Classifiers using Bayesian Networks

number Understand the equivalence between recurring decimals and fractions

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Robust PDF Table Locator

Ling/CMSC 773 Take-Home Midterm Spring 2007

Birkbeck (University of London)

CMPSCI 646, Information Retrieval (Fall 2003)

OPEN INFORMATION EXTRACTION FROM THE WEB. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni

Report for each of the weighted automata obtained ˆ the number of states; ˆ the number of ɛ-transitions;

CS 224N: Assignment #1

Transcription:

TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding of the basic algorithms that are covered in the course. They require only compact answers, such as a short text, calculation, or diagram. Collected wildcards are valid for this part of the exam. The numbering of the questions corresponds to the numbering of the wildcards. 2. Part B consists of 3 items, each worth 6 points. These items test your understanding of the more advanced algorithms that are covered in the course. They require detailed and coherent answers with correct terminology. 3. Part C consists of 1 item worth 12 points. This item tests your understanding of algorithms that have not been explicitly covered in the course. This item requires a detailed and coherent answer with correct terminology. Grade requirements TDDE09: For grade 3, you need at least 12 points in Part A. For grade 4, you additionally need at least 12 points in Part B. For grade 5, you additionally need at least 6 points in Part C. Grade requirements 729A27: For grade G (Pass), you need at least 12 points in Part A. For grade VG (Pass with distinction), you additionally need at least 12 points in Part B, and at least 6 points in Part C. Note that surplus points in one part do not raise your score in another part. Good luck! 1 (14)

Part A 01 Text classification (3 points) A Naive Bayes classifier has to decide whether the document London Paris is news about the United Kingdom (class U) or news about Spain (klass S). a) Estimate the probabilities that are relevant for this decision from the following document collection using Maximum Likelihood estimation (without smoothing). Answer with fractions. document class 1 London Paris U 2 Madrid London S 3 London Madrid U 4 Madrid Paris S b) Based on the estimated probabilities, which class does the classifier predict? Explain. Show that you have understood the Naive Bayes classification rule. c) Practical implementations of a Naive Bayes classifier often use log probabilities. Explain why. 2 (14)

Sample answers: a) Estimated probabilities: P(U) = 2/4 P(London U) = 2/4 P(Paris U) = 1/4 P(S) = 2/4 P(London S) = 1/4 P(Paris S) = 1/4 b) The system first computes class-specific scores: score(u) = P(U) P(London U) P(Paris U) = 2 4 2 4 1 4 = 4 64 score(s) = P(S) P(London S) P(Paris S) = 2 4 1 4 1 4 = 2 64 The system then predicts the class with the highest score, here: U. c) The Naive Bayes classification requires us to multiply probabilities, which are numbers between 0 and 1. If a document contains many words, this can lead to underflow. Converting probabilities to log probabilities avoids this problem. 3 (14)

02 Language modelling (3 points) For the 520 million word Corpus of Contemporary American English, we have the following counts of unigrams and bigrams: your, 883,614; rights, 80,891; doorposts, 21; your rights, 378; your doorposts, 0. a) Estimate the probabilities P(rights) and P(rights your) using Maximum Likelihood estimation (no smoothing). Answer with fractions. b) Estimate the bigram probability P(doorposts your) using Maximum Likelihood estimation and add-k smoothing with k = 0.05. Assume that the vocabulary consists of 1,254,193 unique words. Answer with a fraction. c) Suppose that we train three n-gram models on 38 million words of newspaper text: a unigram model, a bigram model, and a trigram model. Suppose further that we evaluate the trained models on 1.5 million words of text with the same vocabulary and obtain the following perplexity scores: 170, 109, and 962. Which perplexity belongs to which model? Provide a short explanation. Sample answers: a) Maximum Likelihood Estimation: P(rights) = 80,891 520,000,000 P(rights your) = 378 883,614 b) Estimation with add-k smoothing, k = 0.05: P(doorposts your) = 0 + 0.05 883,614 + 0.05 1,254,193 c) Under reasonable assumptions, higher-order models yield lower perplexity scores (or entropy scores). Thus the proper assignment is: 962, unigram; 170, bigram; 109, trigram. 4 (14)

03 Part-of-speech tagging (3 points) The following matrices specify a Hidden Markov model in terms of costs (negative log probabilities). The marked cell gives the transition cost from BOS to PL. PL PN PP VB EOS BOS 11 2 3 4 19 PL 17 3 2 5 7 PN 5 4 3 1 8 PP 12 4 6 7 9 hen vilar ut PL 17 17 4 PN 3 19 19 PP 19 19 3 VB 19 8 19 VB 3 2 3 3 7 When using the Viterbi algorithm to calculate the least expensive (most probable) tag sequence for the sentence hen vilar ut according to this model, one gets the following matrix. Note that the matrix is missing three values (marked cells). BOS 0 hen vilar ut PL A 27 21 PN 5 B 35 PP 22 27 20 VB 23 14 36 EOS C a) Calculate the value for the cell A. Explain your calculation. b) Calculate the values for the cells B and C. Explain your calculations. c) Starting in cell C, draw the backpointers that identify the most probable tag sequence for the sentence. State that tag sequence. 5 (14)

Sample answers: a) A = 11 + 17 = 28 b) B = min(28 + 3 + 19, 5 + 4 + 19, 22 + 4 + 19, 23 + 2 + 19) = 28 C = min(21 + 7, 35 + 8, 20 + 9, 36 + 7) = 28 c) The most probable tag sequence is BOS, PN, VB, PL, EOS. This sequence can be obtained as follows: Start in cell C. In each step, go to the cell in the previous column which gave the least cost in the corresponding minimisation problem. 6 (14)

04 Syntactic analysis (3 points) A transition-based dependency parser analyses the sentence I booked a flight from L. A. Here is the gold-standard tree for this sentence. I booked a flight from L. A. a) Suppose that the parser starts in the initial configuration for the sentence and takes the transitions SH, SH, LA. State the new configuration. To represent the partial dependency tree, list the arcs contained in it. b) State a complete sequence of transitions that takes the parser all the way from the initial configuration to a terminal configuration, and that recreates all arcs of the gold-standard tree. c) Provide a modified transition sequence where the parser mistakingly predicts the arc booked from, but gets the other dependencies right. Sample answers: a) Configuration after SH, SH, LA: stack: [booked] buffer: [a, flight, from, L.A.] Partial dependency tree: booked I b) SH SH LA SH SH LA SH SH RA RA RA c) SH SH LA SH SH LA RA SH SH RA RA 7 (14)

05 Semantic analysis (3 points) The Lesk algorithm is a simple method for word sense disambiguation that relies on the use of dictionaries. Here are glosses and examples from Wiktionary for three different senses of the word course: 1 A normal or customary sequence. 2 A learning program, as in university. I need to take a French course. 3 The direction of movement of a vessel at any given moment. The ship changed its course 15 degrees towards south. a) Which of the three senses does the word course have in the following sentence? In the United States, the normal length of a course is one academic term. b) Which of the three senses does the Lesk algorithm predict based on the given glosses and examples? Ignore the word course, punctuation, and stop words. Explain your answer. c) Change the sentence such that the word course maintains its original sense, but the Lesk algorithm now predicts a different sense than in the previous item. Sample answers: a) Sense 2 b) Sense 1. The algorithm chooses the sense whose lexicon entry (gloss plus example) has the highest number of overlapping words with the sentence. In this case there is 1 overlapping word with the entry for sense 1 (normal) but no overlap with the other entries. c) Example: In the United States, the typical length of a university course is one academic term. (One overlap with sense 2 [university], no overlap with sense 1 [normal changed to typical]) or sense 3. 8 (14)

Part B 06 Levenshtein distance (6 points) The following matrix shows the values computed by the Wagner Fisher algorithm for finding the Levenshtein distance between the two words intention and execution. Note that the matrix is missing some values (marked cells). n A 8 8 8 8 8 8 7 6 5 o A 7 7 7 7 7 7 6 5 6 i A 6 6 6 6 6 6 5 6 7 t A 5 5 5 5 5 5 6 7 8 n A 4 4 4 4 5 6 7 7 7 e A 3 4 B 4 5 6 6 7 8 t A 3 3 3 4 5 6 6 7 8 n A 2 2 3 4 5 6 7 7 7 i A 1 2 3 4 5 6 6 7 8 # A A A A A A A A A A # e x e c u t i o n a) Define the concept of the Levenshtein distance between two words. The definition should be understandable even to readers who have not taken this course. b) Provide the values for the cells marked A. Explain. c) Calculate the value for the cell marked B. Explain. Show that you have understood the Wagner Fisher algorithm. Grading: Subtract points if the answer to any of the following is no : a) Are the operations clearly defined? Is it clear that there is a difference between operations and costs? Is it clear that the Levenshtein distance is the minimum over the costs for all sequences? b) Is the chart correct? Is the explanation correct and complete in that it refers to the operations of insertion and substitution? c) Is the answer correct? Is the explanation complete in that it refers to how the Wagner Fisher algorithm works? Is it clear that in computing B and C we need to pick the smallest value from a list of candidates? 9 (14)

07 Eisner algorithm (6 points) Here is incomplete pseudocode for the Eisner algorithm. Please note that in this code, spans of words are indexed by the leftmost word and the rightmost word in the span, and words are indexed by their positions from 1 to n, the total number of words in the sentence. (This is the same indexing scheme as in the extra material, but different from the scheme in the lecture.) for i in [1,, n]: T 1 [i][i] 0 T 2 [i][i] 0 for k in [2,, n]: for i in [k 1,, 1]: T 3 [i][k] = max i j<k (T 2 [i][j] + T 1 [j + 1][k] + A[i][k]) // three lines missing The table A holds the arc-specific scores. The tables T t hold values from the set R { } and correspond to the four different types of subproblems: i k type 1 i k type 2 i k type 3 i k type 4 These tables are initialised with the value. a) Complete the missing lines. b) State upper bounds for the memory requirement and the runtime of the Eisner algorithm relative to the sentence length n. Explain your answer. c) Identify properties that distinguish the Eisner algorithm from the algorithm for transition-based dependency parsing. Provide your answer as a table: Eisner algorithm Transition-based parsing characteristic property 1 corresponding property 1 characteristic property 2 corresponding property 2 10 (14)

Grading: Subtract points if the answer to any of the following is no : a) The three missing lines: T 4 [i][k] = max i j<k (T 2[i][j] + T 1 [j + 1][k] + A[k][i]) T 1 [i][k] = max i j<k (T 1[i][j] + T 4 [j][k]) T 2 [i][k] = max i<j k (T 3[i][j] + T 2 [j][k]) 1 point if the solution is mostly but not completely correct, e.g., if the bounds on j are not correct. b) Memory requirement: O(n 2 ); four charts, each of size O(n 2 ). Runtime requirement: O(n 3 ); three nested for-loops (one hidden in the max operation), each over O(n) possible values. c) Characteristic properties (and their corresponding properties for transition-based parsing) include: has polynomial-time runtime complexity (linear time); performs an exhaustive search over all projective dependency trees (greedy search); uses an arc-factored feature model (features defined over words in buffer, words on stack, partial dependency tree); solves a combinatorial optimization (solves a sequence of classification problems). 11 (14)

08 Named entity tagging (6 points) Named entity tagging is the task of identifying entities such as persons, organisations, and locations in running text. One idea to approach this task is to use the same techniques as in part-of-speech tagging. However, a complicating factor is that named entities can span more than one word. Consider the following sentence: Thomas Edison was an inventor from the United States. In this example we would like to identify the bigram Thomas Edison as a mention of one named entity of type person (PER), not two words; similarly, we would like to identify United States as one named entity of type location (LOC). To solve this problem, we can use the so-called BIO tagging. In this scheme we introduce a special part-of-speech tag for the beginning (B) and inside (I) of each entity type, as well as one tag for words outside (O) any entity. Here is the example sentence represented with BIO tags: Thomas B-PER Edison I-PER was O an O inventor O from O the O United B-LOC States I-LOC a) Named entity taggers using the BIO tagging scheme can be trained using supervised machine learning, in much the same way as part-of-speech taggers can. Explain what type of gold-standard data is required for this. b) Discuss which type of gold-standard data it is easier to obtain more of: data for part-of-speech taggers or data for named entity taggers. c) Named entity taggers using the BIO tagging scheme can be evaluated at the level of words ( How many words were tagged with the correct BIO tag? ) or at the level of entities ( How many n-grams were tagged with their correct entity type? ). Provide a concrete example, similar to the one above, which shows that a system can have a high word-level tagging accuracy but a low entity-level tagging accuracy. Your example should contain at least one named entity of type person (PER). Explain. 12 (14)

Grading: a) Sentences where words are manually annotated with BIO tags. b) One valid argument gives 1 point; for 2 points, you need to provide several valid arguments, or a nuanced discussion (on the one hand on the other hand), or a clear conclusion that is grounded in the arguments. Valid arguments include: Gathering data for named entity recognition (NER) is easier because it needs less expert knowledge; easier because we can use existing resources like gazetteers and Wikipedia; harder because there are fewer instances of entities per word; harder because there is an ever-growing list of names, while the number of words in part-of-speech tagging grows less quickly; harder because of anaphora; simpler if there are fewer classes of entities (compared to number of classes in part-of-speech tagging); simpler because there is less ambiguity. c) Full points for a complete example including gold-standard BIO tags, predicted BIO tags, and a calculation of word-level and entity level accuracy. Important that there is one B and one I tag per entity type. 13 (14)

Part C 09 Variations on the Eisner algorithm (12 points) Explain your solutions in detail! a) Suppose that you have a list of allowed dependencies of the form w w, where w and w are words. How should you construct the arc matrix A to let the Eisner algorithm decide whether there exists a tree with allowed arcs only? (You can construct the arc matrix differently for each sentence.) b) To compute the score for an item of type 3, the Eisner algorithm solves the following optimisation problem (see item 07): T 3 [i][k] = max i j<k (T 2[i][j] + T 1 [j + 1][k] + A[i][k]) This line contains two operations: The outer operation is max, which maximises a term over all possible choices of a word position j. The inner operation is +, which adds the scores of two simpler subproblems and the score of an arc. Suppose that you want to find the total number of possible trees for an input sentence x. How could you solve this problem by using different mathematical operations for the inner and the outer operation of the rules, as well as appropriately constructing the arc matrix A? c) The Eisner algorithm can be simplified by replacing subproblems of type 3 and subproblems of type 4 with one new type of subproblem, which we may refer to as type 5. To compute the score of subproblems of this type, we combine two triangles without charging the score for an arc: T 5 [i][k] = max i j<k (T 2[i][j] + T 1 [j + 1][k]) How do you have to modify the other rules of the Eisner algorithm such that the modified algorithm becomes equivalent to the original one? 14 (14)