Semantics Isn t Easy Thoughts on the Way Forward

Size: px

Start display at page:

Download "Semantics Isn t Easy Thoughts on the Way Forward"

Ella Morgan
5 years ago
Views:

1 Semantics Isn t Easy Thoughts on the Way Forward NANCY IDE, VASSAR COLLEGE REBECCA PASSONNEAU, COLUMBIA UNIVERSITY COLLIN BAKER, ICSI/UC BERKELEY CHRISTIANE FELLBAUM, PRINCETON UNIVERSITY New York University November 14-15, 2008 Approaches to Semantic Knowledge Acquisition Semi/un- supervised statistical methods to gather facts from unannotated corpora Smaller, (semi-)manual, detailed analyses using annotated resources Both approaches essential best leveraged if performed in close coordination

2 Semi/un-supervised learning Advantages Few resources required (inexpensive) Although may implicitly use throw-away annotations Implicitly reliant on a lot of knowledge Fast Can rapidly amass volumes of information Web provides vast amounts of language data Size overcomes noise? Reliability is a concern Results skewed by different varieties and genres, different types of speakers E.g. British vs. American English split in the frequency of a phenomenon E.g. inclusion of vast amounts of incorrect language or persistent non-native speaker errors unreliable information for ESL? Semi/un-supervised learning Limitations Assumes a (single) set of semantic facts (meanings, relations, etc.) that is stable and discoverable humans can agree on (90% of the time?) Web data unreliable Distributional features that are fundamentally unknown Left with a linguistic black box Variations due to genre, dialects, situation, context, when produced, etc. conflated Cannot capture fluid, dynamic, generative aspects of word and phrasal meaning We have not explored means to represent and process language that takes this into account

3 Analyses relying on annotated resources Advantages Overcomes disadvantages of unsupervised / unannotated approaches Can get at the more fluid and dynamic aspects of language Can examine impact of genre, situation, context, dialect, etc. by contolling corpus content Heavily annotated data enables exploration of interrelations among linguistic layers A critical next step for NLP research Analyses relying on annotated resources Disadvantages Expensive! Costly to manually or even semi-manually produce reliable language data and annotations Slow Manual work takes time

4 MASC Manually Annotated Sub-Corpus NSF-funded project to provide a sharable, reusable annotated resource with rich linguistic annotations Texts from wide range of genres Manual annotations or manually-validated annotations for multiple levels WordNet senses FrameNet frames and frame elements shallow parses named entities Enables linking WordNet senses and FrameNet frames into more complex semantic structures Enriches semantic and pragmatic information Detailed inter-annotator agreement measures Contents Texts drawn from Open ANC Freely distributable portions of LU Corpus Subset of Wall Street Journal texts that have been heavily annotated by multiple projects Several genres Written (travel guides, blog, fiction, letters, newspaper, non-fiction, technical, journal, government documents, court transcript) Spoken (face-to-face, academic, telephone) Free of license restrictions, redistributable All MASC data and annotations freely downloadable from ANC website (

5 Annotation Process Smaller portions of the sub-corpus manually annotated for specific phenomena Maintain representativeness Include as many annotations of different types as possible Apply (semi)-automatic annotation techniques to determine the reliability of their results Study inter-annotator agreement on manually-produced annotations Determine benchmark of accuracy Fine-tune annotator guidelines Consider if accurate annotations for one phenomenon can improve performance of automatic annotation systems for another E.G., Validated WN sense tags and noun chunks may improve automatic semantic role labeling Process (continued) Apply iterative process to maximize performance of automatic taggers Manual annotation Retrain automatic annotation software Improved annotation software later applied to the entire OANC Provide more accurate automatically-produced annotation of full corpus

6 Representation ISO TC37 SC4 Linguistic Annotation Framework Graph of feature structures (GrAF) isomorphic to other feature structure-based representations (e.g. UIMA CAS) Each annotation in a separate stand-off document linked to primary data or other annotations Merge annotations with ANC API Output in any of several formats XML non-xml for use with systems such as NLTK and concordancing tools UIMA CAS Input to GraphViz ANC Pipeline Automatically annotate Merge some or all annotations Texts in different formats ANC processing primary data Annotations as graph of feature structures in stand-off XML documents ANC Tool Input to UIMA Input to GraphViz Input to NLTK others...

7 Transduction Different annotation formats Transduce to GrAF Merge PTB PropBank NomBank PDTB TimeBank Alignment of Lexical Resources Concurrent NSF-funded project investigating how and to what extent WordNet and FrameNet can be aligned MASC annotations of FrameNet frames and frame elements and WordNet senses provide a ready-made testing ground

8 Goals Continually augment MASC with contributed annotations from the research community Discourse structure, additional entities, events, opinions, etc. Distribution of effort and integration of currently independent resources such as the ANC, WordNet, and FrameNet will enable progress in resource development Less cost No duplication of effort Greater degree of accuracy and usability Harmonization MASC can serve as a model for community effort to develop required methods and resources to further NLP research MASC Will be the largest semantically annotated corpus of English in existence Should have a major impact on the speed with which similar resources can be reliably annotated WN and FN annotation of the MASC will immediately create a massive multi-lingual resource network Both WN and FN linked to corresponding resources in other languages No existing resource approaches this scope Because it enables merging annotations at different linguistic levels, will facilitate a deeper investigation of interactions among linguistic phenomena contribute to better understanding of the workings of language at the semantic level

9 Recommendations and Conclusions Pursue automatic acquisition efforts and manual resource creation, annotation, and analysis in parallel Automatic acquisition can get us to the ~80% celing Manual effort can get us the other 20% Embrace the need to render the knowledge resources created by automatic acquisition in a form and format that can interoperate with annotations and other resources NLP community does not need yet-another-independentresource that is difficult or impossible to use with other resources PS OANC available at 1st set of MASC data (~120K words) should be available by end of year, augmented regularly after that We encourage contributions of annotations (automatic or manual) of MASC and/or OANC data for any linguistic phenomenon, in any format We will do the transduction to GrAF

ANC2Go: A Web Application for Customized Corpus Creation

ANC2Go: A Web Application for Customized Corpus Creation Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science, Vassar College Poughkeepsie, New York 12604 USA {ide, suderman, brsimms}@cs.vassar.edu