Corpus Linguistics. Seminar Resources for Computational Linguists SS Magdalena Wolska & Michaela Regneri

Similar documents
HG2052 Language, Technology and the Internet. The Web as Corpus

Final Project Discussion. Adam Meyers Montclair State University

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS

Corpus Linguistics: corpus annotation

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Background and Context for CLASP. Nancy Ide, Vassar College

Treex: Modular NLP Framework

Last Words. Googleology is Bad Science. Adam Kilgarriff Lexical Computing Ltd. and University of Sussex

UIMA-based Annotation Type System for a Text Mining Architecture

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame

Tools for Annotating and Searching Corpora Practical Session 1: Annotating

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

XML Support for Annotated Language Resources

How to.. What is the point of it?

Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF

Ortolang Tools : MarsaTag

NLP in practice, an example: Semantic Role Labeling

CSC 5930/9010: Text Mining GATE Developer Overview

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9

Ling/CSE 472: Introduction to Computational Linguistics. 5/4/17 Parsing

Implementing a Variety of Linguistic Annotations

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Query Difficulty Prediction for Contextual Image Retrieval

A Multilingual Social Media Linguistic Corpus

Supporting a Locale of One: Global Content Delivery for the Individual

Semantics Isn t Easy Thoughts on the Way Forward

Annotating Spatio-Temporal Information in Documents

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank text

Context-Free Grammars

A Short Introduction to CATMA

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Contents. List of Figures. List of Tables. Acknowledgements

ANC2Go: A Web Application for Customized Corpus Creation

Semantic and Multimodal Annotation. CLARA University of Copenhagen August 2011 Susan Windisch Brown

TectoMT: Modular NLP Framework

CS 224N Assignment 2 Writeup

Text Analytics Introduction (Part 1)

Dependency grammar and dependency parsing

Incremental Integer Linear Programming for Non-projective Dependency Parsing

Iterative Learning of Relation Patterns for Market Analysis with UIMA

A BNC-like corpus of American English

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Context-Free Grammars. Carl Pollard Ohio State University. Linguistics 680 Formal Foundations Tuesday, November 10, 2009

Using Search-Logs to Improve Query Tagging

Question Answering Using XML-Tagged Documents

SAPIENT Automation project

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Intro to XML. Borrowed, with author s permission, from:

Lecture 14: Annotation

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Agenda. Information Extraction using Markov Models. I. The IE task. Augmented Google search IE types. 1. The IE problem and its motivation

Exam III March 17, 2010

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

On-line glossary compilation

Apache UIMA and Mayo ctakes

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

Editorial Style. An Overview of Hofstra Law s Editorial Style and Best Practices for Writing for the Web. Office of Communications July 30, 2013

Historical Text Mining:

Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation. ATIR April 28, 2016

Information Extraction Techniques in Terrorism Surveillance

Dependency grammar and dependency parsing

Better Evaluation for Grammatical Error Correction

Fully Delexicalized Contexts for Syntax-Based Word Embeddings

Heading-Based Sectional Hierarchy Identification for HTML Documents

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

LING203: Corpus. March 9, 2009

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

Dependency Parsing. Ganesh Bhosale Neelamadhav G Nilesh Bhosale Pranav Jawale under the guidance of

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

Sustainability of Text-Technological Resources

Making Sense Out of the Web

Probabilistic Parsing of Mathematics

Agenda for today. Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing

Question Answering Systems

A Quick Guide to MaltParser Optimization

Standards for Language Resources

Using the Web as a Corpus. in Natural Language Processing

An Extension of the TIGER Query Language for Treebanks with Frame Semantics Annotation

The American National Corpus First Release

GRADES LANGUAGE! Live, Grades Correlated to the Indiana English/ Language Arts Academic Standards. August 2017

Meaning Banking and Beyond

Usability Test Report: Bento results interface 1

Managing a Multilingual Treebank Project

STRUCTURES AND STRATEGIES FOR STATE SPACE SEARCH

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Parallel Concordancing and Translation. Michael Barlow

Assignment 4 CSE 517: Natural Language Processing

Context-Free Grammars

BD003: Introduction to NLP Part 2 Information Extraction

Introduction to Text Mining. Hongning Wang

Knowledge Engineering with Semantic Web Technologies

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Parts of Speech, Named Entity Recognizer

Transcription:

Seminar Resources for Computational Linguists SS 2007 Magdalena Wolska & Michaela Regneri

Armchair Linguists vs. Corpus Linguists Competence Performance 2

Motivation (for ) 3

Outline Corpora Annotation Data Analysis The Web as Corpus 4

Outline Corpora Annotation Data Analysis The Web as Corpus 5

Corpus - definition in principle: every collection of text (desired or necessary) properties of corpora for linguistic processing: representativeness finite size (mostly) machine-readability standard reference 6

Corpus - properties language mode (speech vs. text) languages and alignment: mono-/bilingual, comparable/parallel balance: homo-/heterogeneous, balanced/unbalanced annotation: plain/annotated, annotation type and depth text types (newspapers, novels, phone calls...) text domains (business, finance, love stories...) date / time span (of texts used) size 7

Outline Corpora Annotation Data Analysis The Web as Corpus 8

Annotation - principles linguistic information in a corpus maxims of annotation (Leech 1993): removable and extractable annotation guidelines available to end user awareness of fallibility (but potential usefulness) scheme should be based on widely-agreed principles which are theory-neutral 9

Annotation - Data Format Often variants of XML: stand-off: The dog barks. <sentence> <phrase type= NP > <word ind= 1 pos= det /> <word ind= 2 pos= N /> </phrase> <phrase type= VP > <word ind= 3 pos= VI /> </phrase> </sentence> inline: <sentence> <phrase type= NP > <word pos= det >the</word> <word pos= N >dog</word> </phrase> <phrase type = VP > <word pos= VI >barks</word> </phrase> </sentence> 10

Annotation - examples: Treebanks (syntax) 11

Annotation - examples: semantic roles (SALSA) 12

Annotation - examples: discourse structure 13

Annotation - Tools Graphical UIs, similar to output, for drawing annotations Example: RSTTool 14

Outline Corpora Annotation Data Analysis The Web as Corpus 15

Data Analysis Word counts (word frequency, token per type ) concordance: same word in different contexts La Streisand sounded just like the student activist she played in the film T s pilot's wings, he was judged top student. After his weapons training, he w with Der Bettelstudent (The Beggar Student) and Gasparone in the fairly rece S.LOWRY: THE MAN AND HIS ART: As a student and long-time resident of Salford Antony Fleat, a second-year law student at Oxford Brookes University; and t oung life. This second-year student at Robert Gordon's university in Ab erdeen, having matriculated as a student at Robert Gordon's university. In M, 78, from Harrow, an anthropology student at the University of the Third Ag he had enough of London as a law student at University College and the Colle n-grams: count the frequencies of word combinations of n words 3868 vergehen Jahr 1184 kommen Jahr 2385 neu Land 1181 jung Mann 2378 letzt Jahr 1107 groß Teil 2296 nah Jahr 997 lang Zeit 1398 erst Mal 986 nah Woche 16

Data Analysis - Information Access pattern matching with query languages like CQP: Query: [lemma="dog"] [pos!="\$.*"]* [pos="nn"] within s; Examples: dog for her daughter dogs on the street dogs and their leashes dog with a cruel owner 17

Outline Corpora Annotation Data Analysis The Web as Corpus 18

The web as corpus the web is a collection of text, thus it is a corpus the largest available corpus: more than 7.2*10 11 words (10 times bigger than the English Gigaword Corpus [Liu and Curran 2006]) nearly all kinds of text and lots of languages present not preprocessed, lots of ungrammatical (and linguistically useless) text how to access it? 19

The web as corpus Document counts are shown to correlate directly with real frequencies (Keller 2003), so search engines can help - but... lots of repetitions of the same text (not representative) very limited query precision (no upper/lower case, no punctuation...) only estimated counts, often hart to reproduce exactly how to access Google? :) (Google API, Scripts) Alexa: buy (parts of) web, and process it on their machines 20

The web as corpus - examples Extracting and filtering web documents to create linguistically annotated corpora (Kilgarriff 2006) gather documents for different topics (balance!) exclude documents which cannot be preprocessed with available tools (here taggers and lemmatizers) exclude documents which seem irrelevant for a corpus (too short or too long, word lists,...) do this for several languages and make the corpora available 21

The web as corpus - examples Directly using web counts (instead of corpus counts), e.g. VerbOcean (Chklovski & Pantel 2004, see http://semantics.isi.edu/ocean/ ) gather verb pairs which are semantically related but the relation is unknown --> DIRT (Lin and Pantel 2001) example pair: love -- marry pick a semantic relation (e.g. happens-before ) and design typical patterns for this relation (e.g. to X and then Y ) instantiate the patterns ( to love and then marry ) and count Google hits (here: 6) estimate whether or not the number of hits indicates a significant correlation, then assign the relation (or not) 22

References Thanks to Sabine Schulte im Walde & Magdalena Wolska for some slides Literature: McEnery & Wilson (1996):. Edinburgh University Press. (See http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm) Chklovski & Pantel (2004): VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations. In Proceedings of EMNLP-04. Keller (2003): Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, 29 2003, Nr. 3, 459 484 Baroni and Kilgarriff (2006): Large linguistically-processed Web Corpora for multiple languages. In Proceedings of EACL-2006. Leech (1993): Corpus annotation schemes. Literary and Linguistic Computing 8(4): 275-81. Lin and Pantel (2001): DIRT Discovery of Inference Rules from Text. In Proceedings of KDD-01. Liu and Curran (2006): Web Text Corpus for Natural Language Processing. In Proceedings of EACL-2006. 23

References Some Corpora: Brown: http://khnt.hit.uib.no/icame/manuals/brown/index.htm LOB: http://khnt.hit.uib.no/icame/manuals/lob/index.htm BNC: http://www.natcorp.ox.ac.uk/ (online search: http://thetis.bl.uk/lookup.html) TIGER: http://www.ims.uni-stuttgart.de/projekte/tiger/tigercorpus/ Penn Treebank: http://www.cis.upenn.edu/~treebank/ Penn Discourse Treebank: http://www.cis.upenn.edu/~pdtb/ Prague Dependency Treebank: http://ufal.mff.cuni.cz/pcedt/ - Michaela Regneri 24