Natural Language Processing: Programming Project II PoS tagging

Similar documents
Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Final Project Discussion. Adam Meyers Montclair State University

LING203: Corpus. March 9, 2009

CS 124/LINGUIST 180 From Languages to Information

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology

Tagging and parsing German using Spejd

Knowledge Extraction from German Automotive Software Requirements using NLP-Techniques and a Grammar-based Pattern Detection

Converting the Corpus Query Language to the Natural Language

CS 124/LINGUIST 180 From Languages to Informa<on. Unix for Poets (in 2013) Christopher Manning Stanford University

LING/C SC/PSYC 438/538. Lecture 2 Sandiway Fong

CS 124/LINGUIST 180 From Languages to Information. Unix for Poets Dan Jurafsky

A Web-based Text Corpora Development System

Text Corpus Format (Version 0.4) Helmut Schmid SfS Tübingen / IMS StuDgart

Research Tools: DIY Text Tools

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9

Formats and standards for metadata, coding and tagging. Paul Meurer

Introduction to Computational Linguistics

Introduction p. 1 Who Should Read This Book? p. 1 What You Need to Know Before Reading This Book p. 2 How This Book Is Organized p.

Natural Language Processing - Project 2 CSE 398/ Prof. Sihong Xie Due Date: 11:59pm, Oct 12, 2017 (Phase I), 11:59pm, Oct 31, 2017 (Phase II)

CSC 5930/9010: Text Mining GATE Developer Overview

Module 1: Information Extraction

Introduction Into Linux Lecture 1 Johannes Werner WS 2017

Overview. Unix/Regex Lab. 1. Setup & Unix review. 2. Count words in a text. 3. Sort a list of words in various ways. 4.

How to.. What is the point of it?

Chapter 4. Unix Tutorial. Unix Shell

Archimob Corpus on Sketch Engine and Annis Getting Started

5/8/2012. Exploring Utilities Chapter 5

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

How SPICE Language Modeling Works

Computer Support for the Analysis and Improvement of the Readability of IT-related Texts

Homework 2: HMM, Viterbi, CRF/Perceptron

TectoMT: Modular NLP Framework

Unix for Poets (in 2016) Christopher Manning Stanford University Linguistics 278

A Short Introduction to CATMA

The American National Corpus First Release

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

DepPattern User Manual beta version. December 2008

Scripting Languages Course 1. Diana Trandabăț

Erhard Hinrichs, Thomas Zastrow University of Tübingen

${Unix_Tools} exercises and solution notes

INLS W: Information Retrieval Systems Design and Implementation. Fall 2009.

UNIX SHELL COMMANDS PRODUCTS MANUAL E-PUB

Getting to grips with Unix and the Linux family

Khmer OCR for Limon R1 Size 22 Report

Using Shallow Processing for Deep Parsing

CS197U: A Hands on Introduction to Unix

CSC401 Natural Language Computing

Story Workbench Quickstart Guide Version 1.2.0

BASH SHELL SCRIPT 1- Introduction to Shell

Linux Essentials. Programming and Data Structures Lab M Tech CS First Year, First Semester

WebAnno: a flexible, web-based annotation tool for CLARIN

Introduction to Linux

Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT

COL100 Lab 2. I semester Week 2, Open the web-browser and visit the page and visit the COL100 course page.

Computationallinguistics of Slavic Prefix SLV Number 304 Title. Units: ages Is this a cross-listed course? No If yes, please identify course(s)

L435/L555. Dept. of Linguistics, Indiana University Fall 2016

Ortolang Tools : MarsaTag

A tool for Cross-Language Pair Annotations: CLPA

Information Extraction Techniques in Terrorism Surveillance

CMSC Honors Introduction to Computer Science 2 Winter Quarter 2007 Lab #2 (01/14/2007) Due: 01/17 (Wed) at 5pm

Unix/Linux Primer. Taras V. Pogorelov and Mike Hallock School of Chemical Sciences, University of Illinois

Linux Command Line Interface. December 27, 2017

Part 1: Basic Commands/U3li3es

Introduction to Scripting using bash

Package corenlp. June 3, 2015

Information Retrieval

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Implementing Some Foundations for a Computer-Grounded AI

CS214-AdvancedUNIX. Lecture 2 Basic commands and regular expressions. Ymir Vigfusson. CS214 p.1

Lecture # 2 Introduction to UNIX (Part 2)

Assignment #6: Markov-Chain Language Learner CSCI E-220 Artificial Intelligence Due: Thursday, October 27, 2011

Orange3-Textable Documentation

New Features of ELCAD/AUCOPLAN

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

WebLicht: Web-based LRT Services in a Distributed escience Infrastructure

Reshaping Text Data for Efficient Processing on Amazon EC2. Gabriela Turcu, Ian Foster, Svetlozar Nestorov

CSC 2500: Unix Lab Fall 2016

Information Retrieval

Inter-Annotator Agreement for a German Newspaper Corpus

Applied Information Security

Center for Reflected Text Analytics. Lecture 2 Annotation tools & Segmentation

do shell script in AppleScript

Tutorial on Text Mining for the Going Digital initiative. Natural Language Processing (NLP), University of Essex

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

9.2 Linux Essentials Exam Objectives

Text Analytics Introduction (Part 1)

Shell Programming Overview

D-SPIN Joint Deliverables R5.3: Documentation of the Web Services R5.4: Final Report

CS 4218 Software Testing and Debugging Ack: Tan Shin Hwei for project description formulation

Perl and R Scripting for Biologists

Contents. List of Figures. List of Tables. Acknowledgements

Tutorial to QuotationFinder_0.4.3

Unix Basics. Benjamin S. Skrainka University College London. July 17, 2010

22-Sep CSCI 2132 Software Development Lecture 8: Shells, Processes, and Job Control. Faculty of Computer Science, Dalhousie University

Practical 02. Bash & shell scripting

BYTE / BOOL A BYTE is an unsigned 8 bit integer. ABOOL is a BYTE that is guaranteed to be either 0 (False) or 1 (True).

Advanced training. Linux components Command shell. LiLux a.s.b.l.

Module 11: Search and Sort Tools

Transcription:

Natural Language Processing: Programming Project II PoS tagging Reykjavik University School of Computer Science November 2009 1 Pre-processing (20%) In this project you experiment with part-of-speech (PoS) tagging using the German Tiger corpus. Go to http://www.ims.uni-stuttgart.de/projekte/ TIGER/TIGERCorpus/ and download the corpus by selecting the License link. The tiger_release_aug07.export file is the one you need to pre-process. This file has various data (see the link Negra Export Format on the Tiger web page), but in this project we are only interested in <word, PoS tag> pairs (starting from line no. 2356). 1.1 Data extraction Here you need to generate one file from tiger_release_aug07.export. The file should contain column 1, as well as columns 3 (an STTS (Stuttgart- Tübingen Tagset) tag) and 4 (morphological tags) merged together. Put a dot (. ) in between the data from columns 3 and 4. Thus, the resulting file contains a token and its morphosyntactic tag. Name this file data.txt. For example, the third and the forth lines in data.txt should look like this: Perot wäre NE.Nom.Sg.Masc VAFIN.3.Sg.Past.Subj Make sure that there is an empty line between sentences in data.txt. If you do the above described extraction correctly, the file contains 939,052 lines. 1

1.2 Split into train and test corpora Assuming you have the file data.txt, you now need to generate a <train, test> pair for this file. The last 49,982 lines (starting with Die Folge ist... ) in the file constitutes the test data, everything above it consists of the training data. Therefore, the output of this phase consists of two corpora: train.txt and test.txt. 1.3 Use a shell script! You should write a shell script which uses some Linux text processing utilities like head, tail, grep, sed, and awk to extract all the relevant data described above. An alternative is to use Perl to perform this task. 2 Base tagger (30%) In this part you develop a base tagger, i.e. a tagger which always selects the most frequent tag for each word (token). Use the following procedure to develop your tagger: 1. Training For the training corpus train.txt, use the Linux tools sed, uniq and sort to construct the file train.freq showing the frequency of each <word, tag> pair appearing in train.txt. Here are the first ten entries from train.freq: 41190, $,.-- 38880. $..-- 15341 und KON.-- 12988 in APPR.-- 8331 $(.-- 8329 $(.-- 7942 der ART.Dat.Sg.Fem 7771 von APPR.-- 6807 die ART.Acc.Sg.Fem 6426 der ART.Gen.Sg.Fem Write a Perl program, builddictionary.pl which reads a file in the format described above (i.e. the format of train.freq). The output is a dictionary, train.dict. Each line has the following form: 2

w wfreq t1 t1freq t2 t2freq... Each line shows how often (wfreq) a specific word w appears in the input, followed by <tag, frequency> pairs for the given word, sorted by descending frequency. For example <t1, t1freq> denotes that w was tagged t1freq times with tag t1 and that t1 was the most frequent tag for w. Note that this dictionary contains more information than is needed for your base tagger, since it only needs information about the most freqent tag, but all this would be needed if you were developing an HMM tagger. 2. Tagging new text (testing). Write a Perl program, basetagger.pl, accepting a dictionary (having the format described above) as input and tokenised text for testing, i.e. one token per line with an empty line between sentences. I suggest that you supply the file test.txt as the second parameter even though it contains both the token and its correct tag the program will just simply ignore the tag. The program outputs one line for each word and its most frequent tag. For unknown words: Use the the tag NN.Nom.Sg.Masc as a default (this tag is the most frequent noun tag in the corpus). Moreover, output <UNKNOWN> at the end of the line to mark the word as unknown. An example of usage: perl basetagger.pl train.dict test.txt test.out An example of lines (no. 12-34) from test.out: $(.-- Man PIS.Nom.Sg.* hat VAFIN.3.Sg.Pres.Ind einen ART.Acc.Sg.Masc Teil NN.Nom.Sg.Masc der ART.Dat.Sg.Fem Landsleute NN.Gen.Pl.* aus APPR.-- dem ART.Dat.Sg.Masc demokratischen ADJA.Pos.Dat.Sg.Fem Konsens NN.Acc.Sg.Masc 3

verloren VVPP.Psp, $,.-- weil KOUS.-- sie PPER.3.Nom.Sg.Fem sich PRF.3.Acc.Sg solcher PIAT.Gen.Pl.Neut Anbiederung NN.Nom.Sg.Masc <UNKNOWN> nicht PTKNEG.-- anschließen VVINF.Inf wollten VMFIN.3.Pl.Past.Ind. $..-- 3 Accuracy (30%) Write a Perl program, accuracy.pl, accepting an input file in the following form: word correct_tag word tag_from_tagger <UNKNOWN> (<UNKNOWN> only appears if the word is unknown). The input file is thus a combination of a gold standard file and the output from a tagger. The output is written to standard out and shows the accuracy of the tagger for known words, unknown words and all words. An example of usage: perl accuracy.pl <combined_file> Example output: Number of tokens: 47372 Number of errors: 16384 Overall tagging accuracy: 65.41% Tagging accuracy for known words: 71.19% Number of unknown words: 4108 Unknown word ratio: 8.67% Number of errors for unknown words: 3918 Tagging accuracy for unknown words: 4.63% For this part, I recommend that you write a script which starts by combining test.txt and test.out into one file and then executes accuracy.pl. 4

4 Training and testing an HMM tagger (20%) In this part you need to train TriTagger, the trigram tagger which is part of the IceNLP toolkit, and then use it to tag the test corpus. Important: IceNLP assumes all files are UTF-8 encoded. Therefore, you have to convert your training and test corpora to UTF-8 (if you have not done that already) before using TriTagger. 1. Download IceNLP from http://www.ru.is/faculty/hrafn/software/ IceNLP-1.3.zip, and extract to a directory of your choice. 2. Read the section on TriTagger in the user manual IceNLP.pdf (available in the /doc directory) to become familiar with how to train using a training corpus, and run the tagger using a test corpus. 3. Make a version of your test corpus which only includes the words, but not the PoS tags. 4. Build a training model using the train.txt corpus. Show the command you use for training. 5. Now use TriTagger to tag the test corpus 1 and make it write the output to the file test.tri.out. Show the command you use for tagging (testing). 6. Finally, use your accuracy.pl program to compute the accuracy of TriTagger for the test corpus. Show the output of accuracy.pl. 5 What to hand in Return a report describing each step you took to solve this project. In particular, describe all commands used and programs and scripts implemented. Moreover, handin all scripts and programs, along with instructions on how to use them. Also make sure that you discuss the results. 1 Make sure that you start TriTagger with the -cs flag. Information about the flags is in the user manual and you can also see command line flags by typing./tritagger.sh -help. 5