&27L* /,1/D3 D /DQJXDJH,QGHSHQGHQW 1/3$UFKLWHFWXUH XVHGDV*UDPPDU&KHFNHU
|
|
- Monica Williamson
- 5 years ago
- Views:
Transcription
1 &27L* /,1/D3 D /DQJXDJH,QGHSHQGHQW 1/3$UFKLWHFWXUH XVHGDV*UDPPDU&KHFNHU )UDQFHVF%HQDYHQW */L&RP 83) 1/36HPLQDU 83& 1RYHPEHUWK,
2 Introduction Architecture Data repr. Modules Discussion,QGH[,QWURGXFWLRQ $UFKLWHFWXUH 'DWDUHSUHVHQWDWLRQ 0RGXOHV 'LVFXVVLRQ,,
3 ,QWURGXFWLRQ Architecture Data repr. Modules Discussion,QWURGXFWLRQ 1.1. Overview 1.2. Technology 1.3. Context and needs
4 ,QWURGXFWLRQ Architecture Data repr. Modules Discussion 2YHUYLHZ 4 Development framework: 4 as surface-oriented linguistic analyzer 4 for robust NLP-enhanced applications 4 In academic and industrial environments 4 Language-independent: 4 linguistic information as external resources 4 currently dealing with Catalan texts 4 Spanish resources are underway
5 ,QWURGXFWLRQ Architecture Data repr. Modules Discussion 7HFKQRORJ\ 4 Programmed in C# and C++ 4 OOP (2EMHFW2ULHQWHG3URJUDPPLQJ) 4 API ($SSOLFDWLRQ3URJUDPPLQJ,QWHUIDFH) 4 Using.Net/Mono platform 4 Linux 4 Windows 4 Mac OS x
6 ,QWURGXFWLRQ Architecture Data repr. Modules Discussion &RQWH[WDQGQHHGV 4 Engine requirements 4 External data driven (detection rules) 4 Broad span of external GUIs (Web, Word, Firefox, ) 4 Reusable to cover future (unknown) applications 4 Project Limitations 4 Short development period (10-12 months) 4 Parallel development (up to 5 programmers)
7 Introduction $UFKLWHFWXUH Data repr. Modules Discussion $UFKLWHFWXUH 2.1. LINLaP 2.2. COTiG 2.3. Encapsulation 2.4. Libraries
8 Introduction $UFKLWHFWXUH Data repr. Modules Discussion /,1/D3 4 Modular: 4 Flexible 4 Customizable 4 Linear: 4 Segmentation 4 Dict. Lookup 4 PoS Tagging 4 Other... 4 Progressive enrichment %5($.(5 /$%(/(5 &+226(5 RWKHU 7H[W6WULQJV 7RNHQ/LVW 7/3R6>Q@ 7/3R6 7/LQIR 3URFHVVLQJ ',&7
9 Introduction $UFKLWHFWXUH Data repr. Modules Discussion &27L* 7H[W6WULQJV 3URFHVVLQJ 'HWHFWLRQ 7<32 63(// *5$0 %5($.(5 /$%(/(5 &+226(5 RWKHU 7RNHQ/LVW 7/3R6 4 2UWKRW\SRJUDSKLF HUURUV 4 2UWKRJUDSKLFHUURUV 4 *UDPPDWLFDOHUURUV 7/LQIR ',&7
10 Introduction $UFKLWHFWXUH Data repr. Modules Discussion (QFDSVXODWLRQ 4 Inner APIs: 4 Independent development 4 Internal testing 4 External evaluation 4 Easy integration 4 Parallel development (dummy modules) 4 Extreme Modularity: 4 Inclusion, combination and extension of modules 4 Freedom for implementation and used techniques (e.g. Handcrafted rules vs. ML-induced models)
11 Introduction $UFKLWHFWXUH Data repr. Modules Discussion /LEUDULHV &RWLJ7RNHQGOO &RWLJ7\SRGOO &RWLJ/DEHOGOO &RWLJ6SHOOGOO &RWLJ&KRRVHGOO &RWLJ 0DLQ &DW GOO *UDSKLF 8VHU,QWHUIDFHV &RWLJ*UDPGOO,&RWLJ0DLQ,&RWLJ! &RWLJ'LFWGOO &RWLJ6KDUHGGOO
12 Introduction Architecture 'DWDUHSU Modules Discussion 'DWDUHSUHVHQWDWLRQ 3.1. Blocks 3.2. Tokens 3.3. Labels 3.4. Errors
13 Introduction Architecture 'DWDUHSU Modules Discussion %ORFNV 3URSHUW\ 7\SH 'HVFULSWLRQ 7H[W <string> Source text %ORFN7\SH <enum> 8: Any/ Controls / Blanks / NoText/ Paragraph / Title / List /Oth.,V3URFHVVDEOH <bool> If this block must be linguistically processed 6SDQ)URP&KDU <int> Starting position in the original document 6SDQ7R&KDU <int> Ending position in the original document /HQJWK,Q&KDUDFWHUV <int> Length of the block 4 Structural unit (block of text) 4 A document is a list of Blocks
14 Introduction Architecture 'DWDUHSU Modules Discussion 7RNHQV 3URSHUW\ 7\SH 'HVFULSWLRQ 6SDQ)URP&KDU <int> Starting position in the original block 6SDQ7R&KDU <int> Ending position in the original block 6RXUFH <string> Source text 1RUPDOL]HG <string> Normalized text 7RNHQ7\SH <enum> 26*: Any / Tag / Break / Punct / Symbl/ Num / Word / Complex /... 7RNHQ6KDSH <enum> 20*: None / Lower / Upper / Title / Plain / FinalDot / StartApos /... 3RVVLEOH/DEHOV <label[]> Once labelled, the list of possible reads %HVW/DEHO <label> Once disambiguated, the most probable label,v&rpsoh[ <bool> If this token has been created by mergin simple tokens 4 Lexical unit (string of chars) 4 A Block is a list of Tokens
15 Introduction Architecture 'DWDUHSU Modules Discussion /DEHOV 3URSHUW\ 7\SH 'HVFULSWLRQ )RUP <string> Form /HPPD <string> Lemma &DWHJRU\ <enum> 44*: Any / Determiner / Adjective / Noun / Pronoun / Verb / Adverb / Conjunction / Interjection / Preposition / Data / Punctuation /... )HDWXUHV <enum> 24*: Gender / Number / Person / Case / Pronoun / Tense / Proper /... 'LDOHFW <enum> 8*: Variant / Style &DQRQLFDO6KDSH <enum> 20: None / Lower / Upper / Title / Plain / FinalDot / StartDash /... )UHTXHQF\ <float> Relative frequency of this lemma in a standard corpus 4 One morphosyntactic reading 4 Used in: tagging and dictionary
16 Introduction Architecture 'DWDUHSU Modules Discussion (UURUV 3URSHUW\ 7\SH 'HVFULSWLRQ 6SDQ)URP7RNHQ <int> Starting position in the current TokenList 6SDQ7R7RNHQ <int> Ending position in the current TokenList &RGH <enum> Classification of the Error: Typo, Spell, Gram; Missing, Wrong, Added; Sign, Form, Phrase; 0HVVDJH <string> Description text displayed to the final user 5XOH <string> Description of the fired rule (for debug pourpouses) 'HWHFWRU <enum> Module that found the error and included it in the ErrorList &RUUHFWLRQV <TokList[]> List of suggested multitoken replacements 4 One specific detected errror (may include suggestions) 4 Created by detection modules
17 Introduction Architecture Data repr. 0RGXOHV Discussion 0RGXOHV 4.1. Dictionary 4.2. Breaker P 4.3. Labeler P Processing 4.4. Chooser P 4.5. Typo D 4.6. Speller D Detection 4.7. Grammar D
18 Introduction Architecture Data repr. 0RGXOHV Discussion 0RGXOHV 'HWHFWLRQ 7<32 63(// *5$0 %5($.(5 /$%(/(5 &+226(5 RWKHU 7H[W6WULQJV 7RNHQ/LVW 7/3R6 3URFHVVLQJ 7/LQIR ',&7
19 Introduction Architecture Data repr. 0RGXOHV Discussion 'LFWLRQDU\ 4 Resource Module: 4 Shared by any module that need lexical information 4 Contains lexical entries with relevant information (morphosyntactic, typographic, stylistic, dialect-related) n 4 Allows direct and inverse searches )RUP /DEHOV>@ /DEHO )RUPV>@ ',&7 ',&7
20 Introduction Architecture Data repr. 0RGXOHV Discussion %UHDNHU 3 El mètode API_Get_Money_(_date_) retorna 14.95_$_. L'_exemple té dues oracions_. Aquesta és la segona_. 4 Segmentation levels: 4 Blocks 4 HTML (tags) 4 Raw text (heuristics) 4 Tokens 4 merging utokens 4 by following contextual rules (grammar) 4 Sentences (ambiguity -> conservative strategy) 4 inserting limit markers 4 by following contextual rules (patterns)
21 Introduction Architecture Data repr. 0RGXOHV Discussion /DEHOHU 3 4,QSXW a list of tokens (PossibleLabels[Ø]) 4 2XWSXW an enriched list of tokens (PossibleLabels[1..n]) 4 Depending on token type: 4 Lexical word: dictionary lookup 4 Non-lexical: pre-determined mapping 4 Implementation 4 As a word form list 4 Indexed by two hash tables
22 Introduction Architecture Data repr. 0RGXOHV Discussion &KRRVHU 3 4,QSXW a list of tokens (BestLabel=Ø) 4 2XWSXW an enriched list of tokens (BestLabel=label m ) 4 Disambiguation task: 4 Choosing the most probable label from candidates 4 Implementation 4 As a standard stochastic tagger (HMM / trigram) 4 Unknown words: add-one smoothing 4 Unknown trigrams: backoff (bigram and unigram)
23 Introduction Architecture Data repr. 0RGXOHV Discussion 0RGXOHV 'HWHFWLRQ 7<32 63(// *5$0 %5($.(5 /$%(/(5 &+226(5 RWKHU 7H[W6WULQJV 7RNHQ/LVW 7/3R6 3URFHVVLQJ 7/LQIR ',&7
24 Introduction Architecture Data repr. 0RGXOHV Discussion 7\SR ' 4,QSXW a list of unlabelled tokens 4 2XWSXW a list of orthotypographic errors 4 Detection task: 4 Find contextual orthotypographic errors (case, spcs, punct...) 4 Implementation 4 Detection: Hard-coded patterns encode rules of common mistakes 4 Suggestions: Hard-coded patterns encode re-generation of token list
25 Introduction Architecture Data repr. 0RGXOHV Discussion 6SHOO ' 4,QSXW a list of labeled tokens (PossibleLabels= Ø label[]) 4 2XWSXW a list of orthographic errors 4 Detection task: 4 Find non-contextual orthographic errors 4 Implementation 4 Detection: words not found in dictionary 4 Suggestion: Inspired in NetSpell (distance)
26 Introduction Architecture Data repr. 0RGXOHV Discussion *UDPPDU ' 4,QSXW a list of disamb. tokens (BestLabel= label m ) 4 2XWSXW a list of grammatical errors 4 Detection task: 4 Find contextual grammatical errors (concordance,...) 4 Implementation 4 Declaring contextual rules manually following a formalism defined for the project
27 Introduction Architecture Data repr. Modules 'LVFXVVLRQ 'LVFXVVLRQ 5.1. On the architecture 5.2. On the modules 5.3. Future improvements 5.4. Conclusions
28 Introduction Architecture Data repr. Modules 'LVFXVVLRQ 2QWKHDUFKLWHFWXUH 4 Advantages: 4 Strict division of levels -> Easier task sharing 4 Strict separation of modules -> Easier development 4 Encapsulation -> Flexibility and smooth integration 4 Object based -> Robustness and efficiency 4 Blackboard inspired -> all modules see all the information 4 Limitations: 4 One-to-one mapping between Tokens<->Labels (issues on contractions and chunk labelling)
29 Introduction Architecture Data repr. Modules 'LVFXVVLRQ 2QWKHPRGXOHV 4 Breaker: level segmentation 4 + expresivity of token patterns 4 - hard-coded patterns (external files are underway) 4 Labeler (Dictionary based) 4 + high-speed tagging (linear time) 4 - high-memory resources (700K words = 100Mb) 4 Chooser 4 + precision and speed 4 - low granularity (currently only 8 PoS)
30 Introduction Architecture Data repr. Modules 'LVFXVVLRQ )XWXUHLPSURYHPHQWV 4 On the modules: 4 Breaker: tokenization rules from external files 4 Chooser: increasing the granularity of PoS 4 On the architecture: 4 Replacement of TokenList for a TokenChart (in order to hold multi-token entities) 4 Adding support for generic n-level annotations
31 Introduction Architecture Data repr. Modules Discussion &RQFOXVLRQV, /,1/D3, an H[SDQGDEOH, PXOWLSODWIRUP, ODQJXDJHLQGHSHQGHQW architecture, development framework for UREXVW1/3 applications. Its architecture makes it easy to FRPELQH different PRGXOHVDQGWHFKQLTXHV, as well as the H[WHQVLRQWRKLJKHUOHYHOV of linguistic description.
32 Introduction Architecture Data repr. Modules Discussion &RQFOXVLRQV,, The FXUUHQWYHUVLRQ handles &DWDODQ texts (6SDQLVK is underway) and is limited to PRUSKRV\QWDFWLF WDJJLQJ It has been DGDSWHGWR context-sensitive HUURUFRUUHFWLRQ, and it will be used as part of an LQIRUPDWLRQH[WUDFWLRQ and an DXWRPDWLFVXPPDUL]DWLRQ system.
33 Introduction Architecture Data repr. Modules Discussion > TXHVWLRQV_ VXJJHVWLRQV_ )UDQFHVF%HQDYHQW */L&RP 83) 1/36HPLQDU 83& 1RYHPEHUWK
Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras
Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,
More informationOrtolang Tools : MarsaTag
Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements
More informationFinal Project Discussion. Adam Meyers Montclair State University
Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...
More informationNLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014
NLP Chain Giuseppe Castellucci castellucci@ing.uniroma2.it Web Mining & Retrieval a.a. 2013/2014 Outline NLP chains RevNLT Exercise NLP chain Automatic analysis of texts At different levels Token Morphological
More informationNLP Final Project Fall 2015, Due Friday, December 18
NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,
More informationIntroducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS
Introducing XAIRA An XML aware tool for corpus indexing and searching Lou Burnard Tony Dodd Research Technology Services, OUCS What is XAIRA? XML Aware Indexing and Retrieval Architecture Developed from
More informationExam III March 17, 2010
CIS 4930 NLP Print Your Name Exam III March 17, 2010 Total Score Your work is to be done individually. The exam is worth 106 points (six points of extra credit are available throughout the exam) and it
More informationMaca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology
Maca a configurable tool to integrate Polish morphological data Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Outline Morphological resources for Polish Tagset and segmentation differences
More informationApache UIMA and Mayo ctakes
Apache and Mayo and how it is used in the clinical domain March 16, 2012 Apache and Mayo Outline 1 Apache and Mayo Outline 1 2 Introducing Pipeline Modules Apache and Mayo What is? (You - eee - muh) Unstructured
More informationSchool of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)
Discussion School of Computing and Information Systems The University of Melbourne COMP9004 WEB SEARCH AND TEXT ANALYSIS (Semester, 07). What is a POS tag? Sample solutions for discussion exercises: Week
More informationSyntax and Grammars 1 / 21
Syntax and Grammars 1 / 21 Outline What is a language? Abstract syntax and grammars Abstract syntax vs. concrete syntax Encoding grammars as Haskell data types What is a language? 2 / 21 What is a language?
More informationHidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney
Hidden Markov Models Natural Language Processing: Jordan Boyd-Graber University of Colorado Boulder LECTURE 20 Adapted from material by Ray Mooney Natural Language Processing: Jordan Boyd-Graber Boulder
More informationLet s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed
Let s get parsing! SpaCy default model includes tagger, parser and entity recognizer nlp = spacy.load('en ) tells spacy to use "en" with ["tagger", "parser", "ner"] Each component processes the Doc object,
More informationVision Plan. For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0
Vision Plan For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0 Submitted in partial fulfillment of the Masters of Software Engineering Degree. Naga Sowjanya Karumuri CIS 895 MSE Project
More informationTectoMT: Modular NLP Framework
: Modular NLP Framework Martin Popel, Zdeněk Žabokrtský ÚFAL, Charles University in Prague IceTAL, 7th International Conference on Natural Language Processing August 17, 2010, Reykjavik Outline Motivation
More informationINF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9
1 INF5830 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 4, 10.9 2 Working with texts From bits to meaningful units Today: 3 Reading in texts Character encodings and Unicode Word tokenization
More informationA Multilingual Social Media Linguistic Corpus
A Multilingual Social Media Linguistic Corpus Luis Rei 1,2 Dunja Mladenić 1,2 Simon Krek 1 1 Artificial Intelligence Laboratory Jožef Stefan Institute 2 Jožef Stefan International Postgraduate School 4th
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationAssessing the Quality of Natural Language Text
Assessing the Quality of Natural Language Text DC Research Ulm (RIC/AM) daniel.sonntag@dfki.de GI 2004 Agenda Introduction and Background to Text Quality Text Quality Dimensions Intrinsic Text Quality,
More information13.1 End Marks Using Periods Rule Use a period to end a declarative sentence a statement of fact or opinion.
13.1 End Marks Using Periods Rule 13.1.1 Use a period to end a declarative sentence a statement of fact or opinion. Rule 13.1.2 Use a period to end most imperative sentences sentences that give directions
More informationA CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012
A CAS STUDY: Structure learning for Part-of-Speech Tagging Danilo Croce WM 2011/2012 27 gennaio 2012 TASK definition One of the tasks of VALITA 2009 VALITA is an initiative devoted to the evaluation of
More informationTS Wikipedia Corpus. TS_Wikipedia_ tri_gram.xml
What is? Data Set is a collection of processed Turkish Wikipedia pages. The source of the data is Turkish wiki-dumps 1. The set is a collection of eight (8) separate files which are named as 2 : TS_Wikipedia_
More informationCompilers. Lecture 2 Overview. (original slides by Sam
Compilers Lecture 2 Overview Yannis Smaragdakis, U. Athens Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Last time The compilation problem Source language High-level abstractions Easy
More informationTreex: Modular NLP Framework
: Modular NLP Framework Martin Popel ÚFAL (Institute of Formal and Applied Linguistics) Charles University in Prague September 2015, Prague, MT Marathon Outline Motivation, vs. architecture internals Future
More informationApache UIMA ConceptMapper Annotator Documentation
Apache UIMA ConceptMapper Annotator Documentation Written and maintained by the Apache UIMA Development Community Version 2.3.1 Copyright 2006, 2011 The Apache Software Foundation License and Disclaimer.
More informationLAB 3: Text processing + Apache OpenNLP
LAB 3: Text processing + Apache OpenNLP 1. Motivation: The text that was derived (e.g., crawling + using Apache Tika) must be processed before being used in an information retrieval system. Text processing
More informationAPPROACHES TO IMPLEMENT SEMANTIC SEARCH. Johannes Peter Product Owner / Architect for Search
APPROACHES TO IMPLEMENT SEMANTIC SEARCH Johannes Peter Product Owner / Architect for Search 1 WHAT IS SEMANTIC SEARCH? 2 Success of search Interface of shops to brains of customers Wide range of usage
More informationMorpho-syntactic Analysis with the Stanford CoreNLP
Morpho-syntactic Analysis with the Stanford CoreNLP Danilo Croce croce@info.uniroma2.it WmIR 2015/2016 Objectives of this tutorial Use of a Natural Language Toolkit CoreNLP toolkit Morpho-syntactic analysis
More informationCSC401 Natural Language Computing
CSC401 Natural Language Computing Jan 19, 2018 TA: Willie Chang Varada Kolhatkar, Ka-Chun Won, and Aryan Arbabi) Mascots: r/sandersforpresident (left) and r/the_donald (right) To perform sentiment analysis
More informationAn Adaptive Framework for Named Entity Combination
An Adaptive Framework for Named Entity Combination Bogdan Sacaleanu 1, Günter Neumann 2 1 IMC AG, 2 DFKI GmbH 1 New Business Department, 2 Language Technology Department Saarbrücken, Germany E-mail: Bogdan.Sacaleanu@im-c.de,
More informationA tool for Cross-Language Pair Annotations: CLPA
A tool for Cross-Language Pair Annotations: CLPA August 28, 2006 This document describes our tool called Cross-Language Pair Annotator (CLPA) that is capable to automatically annotate cognates and false
More informationNLP - Based Expert System for Database Design and Development
NLP - Based Expert System for Database Design and Development U. Leelarathna 1, G. Ranasinghe 1, N. Wimalasena 1, D. Weerasinghe 1, A. Karunananda 2 Faculty of Information Technology, University of Moratuwa,
More informationDependency Parsing. Ganesh Bhosale Neelamadhav G Nilesh Bhosale Pranav Jawale under the guidance of
Dependency Parsing Ganesh Bhosale - 09305034 Neelamadhav G. - 09305045 Nilesh Bhosale - 09305070 Pranav Jawale - 09307606 under the guidance of Prof. Pushpak Bhattacharyya Department of Computer Science
More informationExam Marco Kuhlmann. This exam consists of three parts:
TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding
More informationA Bambara Tonalization System for Word Sense Disambiguation Using Differential Coding, Segmentation and Edit Operation Filtering
A Bambara Tonalization System for Word Sense Disambiguation Using Differential Coding, Segmentation and Edit Operation Filtering Luigi (Y.-C.) Liu Damien Nouvel ER-TIM, INALCO, 2 rue de Lille, Paris, France
More informationStandards for Language Resources
Standards for Language Resources Nancy IDE Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA ide@cs.vassar.edu Laurent ROMARY Equipe Langue et Dialogue LORIA/INRIA 54506
More informationNatural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison
Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the
More informationPackage phrasemachine
Type Package Title Simple Phrase Extraction Version 1.1.2 Date 2017-05-29 Package phrasemachine May 29, 2017 Author Matthew J. Denny, Abram Handler, Brendan O'Connor Maintainer Matthew J. Denny
More informationCOMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview
COMP 181 Compilers Lecture 2 Overview September 7, 2006 Administrative Book? Hopefully: Compilers by Aho, Lam, Sethi, Ullman Mailing list Handouts? Programming assignments For next time, write a hello,
More informationLab II - Product Specification Outline. CS 411W Lab II. Prototype Product Specification For CLASH. Professor Janet Brunelle Professor Hill Price
Lab II - Product Specification Outline CS 411W Lab II Prototype Product Specification For CLASH Professor Janet Brunelle Professor Hill Price Prepared by: Artem Fisan Date: 04/20/2015 Table of Contents
More informationQuestion Answering Using XML-Tagged Documents
Question Answering Using XML-Tagged Documents Ken Litkowski ken@clres.com http://www.clres.com http://www.clres.com/trec11/index.html XML QA System P Full text processing of TREC top 20 documents Sentence
More informationNgram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department
More informationDependency grammar and dependency parsing
Dependency grammar and dependency parsing Syntactic analysis (5LN455) 2014-12-10 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Mid-course evaluation Mostly positive
More informationDependency grammar and dependency parsing
Dependency grammar and dependency parsing Syntactic analysis (5LN455) 2015-12-09 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Activities - dependency parsing
More informationReview Spam Analysis using Term-Frequencies
Volume 03 - Issue 06 June 2018 PP. 132-140 Review Spam Analysis using Term-Frequencies Jyoti G.Biradar School of Mathematics and Computing Sciences Department of Computer Science Rani Channamma University
More informationRandom Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li
Random Walks for Knowledge-Based Word Sense Disambiguation Qiuyu Li Word Sense Disambiguation 1 Supervised - using labeled training sets (features and proper sense label) 2 Unsupervised - only use unlabeled
More informationDependency grammar and dependency parsing
Dependency grammar and dependency parsing Syntactic analysis (5LN455) 2016-12-05 Sara Stymne Department of Linguistics and Philology Based on slides from Marco Kuhlmann Activities - dependency parsing
More informationGTE: DESCRIPTION OF THE TIA SYSTEM USED FOR MUC- 3
GTE: DESCRIPTION OF THE TIA SYSTEM USED FOR MUC- 3 INTRODUCTIO N Robert Dietz GTE Government Systems Corporatio n 100 Ferguson Drive Mountain View, CA 9403 9 dietz%gtewd.dnet@gte.com (415) 966-2825 This
More informationMigrating LINA Laboratory to Apache UIMA
Migrating LINA Laboratory to Apache UIMA Stegos Afantenos et Matthieu Vernier Équipe TALN - Laboratoire Informatique Nantes Atlantique Vendredi 10 Juillet 2009 Afantenos, Vernier (TALN - LINA) UIMA @ LINA
More informationThe Web Enabling Company
The Web Enabling Company Integrating Linguistic Products into Corporate Applications Elisabeth Maier Canoo Engineering AG Basel-Switzerland elisabeth.maier@canoo.com www.canoo.com, www.canoo.net Page 1
More informationSentiment Analysis using Support Vector Machine based on Feature Selection and Semantic Analysis
Sentiment Analysis using Support Vector Machine based on Feature Selection and Semantic Analysis Bhumika M. Jadav M.E. Scholar, L. D. College of Engineering Ahmedabad, India Vimalkumar B. Vaghela, PhD
More informationAdvanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation. ATIR April 28, 2016
Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation Vinay Setty vsetty@mpi-inf.mpg.de Jannik Strötgen jannik.stroetgen@mpi-inf.mpg.de ATIR April 28, 2016 Organizational
More informationA Web-based Text Corpora Development System
A Web-based Text Corpora Development System Dan Bohuş, Marian Boldea Politehnica University of Timişoara Vasile Pârvan 2, 1900 Timişoara, Romania bd1206, boldea @cs.utt.ro Abstract One of the most important
More informationDepPattern User Manual beta version. December 2008
DepPattern User Manual beta version December 2008 Contents 1 DepPattern: A Grammar Based Generator of Multilingual Parsers 1 1.1 Contributions....................................... 1 1.2 Supported Languages..................................
More informationQuery Difficulty Prediction for Contextual Image Retrieval
Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.
More informationA Flexible Distributed Architecture for Natural Language Analyzers
A Flexible Distributed Architecture for Natural Language Analyzers Xavier Carreras & Lluís Padró TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya
More informationImplementing a Variety of Linguistic Annotations
Implementing a Variety of Linguistic Annotations through a Common Web-Service Interface Adam Funk, Ian Roberts, Wim Peters University of Sheffield 18 May 2010 Adam Funk, Ian Roberts, Wim Peters Implementing
More informationThe CKY algorithm part 1: Recognition
The CKY algorithm part 1: Recognition Syntactic analysis (5LN455) 2016-11-10 Sara Stymne Department of Linguistics and Philology Mostly based on slides from Marco Kuhlmann Phrase structure trees S root
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process Processing Text Converting documents to index terms Why? Matching the exact
More informationTHE POSIT TOOLSET WITH GRAPHICAL USER INTERFACE
THE POSIT TOOLSET WITH GRAPHICAL USER INTERFACE Martin Baillie George R. S. Weir Department of Computer and Information Sciences University of Strathclyde Glasgow G1 1XH UK mbaillie@cis.strath.ac.uk george.weir@cis.strath.ac.uk
More informationIdentifying Idioms of Source Code Identifier in Java Context
, pp.174-178 http://dx.doi.org/10.14257/astl.2013 Identifying Idioms of Source Code Identifier in Java Context Suntae Kim 1, Rhan Jung 1* 1 Department of Computer Engineering, Kangwon National University,
More informationEditorial Style. An Overview of Hofstra Law s Editorial Style and Best Practices for Writing for the Web. Office of Communications July 30, 2013
Editorial Style An Overview of Hofstra Law s Editorial Style and Best Practices for Writing for the Web Office of Communications July 30, 2013 What Is Editorial Style? Editorial style refers to: Spelling
More informationTokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017
Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation
More informationParsing tree matching based question answering
Parsing tree matching based question answering Ping Chen Dept. of Computer and Math Sciences University of Houston-Downtown chenp@uhd.edu Wei Ding Dept. of Computer Science University of Massachusetts
More informationHow to Generalize the Task of Annotation
8 How to Generalize the Task of Annotation STEVE FLIGELSTONE, MIKE PACEY and PAUL RAYSON 8.1 Introduction In the last 20 years, UCREL s principal technique for automatic grammatical analysis has been a
More informationBD003: Introduction to NLP Part 2 Information Extraction
BD003: Introduction to NLP Part 2 Information Extraction The University of Sheffield, 1995-2017 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. Contents This
More informationDocumentation and analysis of an. endangered language: aspects of. the grammar of Griko
Documentation and analysis of an endangered language: aspects of the grammar of Griko Database and Website manual Antonis Anastasopoulos Marika Lekakou NTUA UOI December 12, 2013 Contents Introduction...............................
More informationINTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume
More informationInformation Extraction
Information Extraction A Survey Katharina Kaiser and Silvia Miksch Vienna University of Technology Institute of Software Technology & Interactive Systems Asgaard-TR-2005-6 May 2005 Authors: Katharina Kaiser
More informationSEMANTIC ANALYSIS TYPES AND DECLARATIONS
SEMANTIC ANALYSIS CS 403: Type Checking Stefan D. Bruda Winter 2015 Parsing only verifies that the program consists of tokens arranged in a syntactically valid combination now we move to check whether
More informationRefresher on Dependency Syntax and the Nivre Algorithm
Refresher on Dependency yntax and Nivre Algorithm Richard Johansson 1 Introduction This document gives more details about some important topics that re discussed very quickly during lecture: dependency
More informationIntegrating Spanish Linguistic Resources in a Web Site Assistant
Integrating Spanish Linguistic Resources in a Web Site Assistant Paloma Martínez*, Ana García-Serrano, Alberto Ruiz-Cristina * Universidad Carlos III de Madrid Avd. Universidad 30, 28911 Leganés, Madrid,
More informationAutomated Extraction of Event Details from Text Snippets
Automated Extraction of Event Details from Text Snippets Kavi Goel, Pei-Chin Wang December 16, 2005 1 Introduction We receive emails about events all the time. A message will typically include the title
More informationStandards for Language Resources
Standards for Language Resources Nancy Ide,* Laurent Romary * Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA ide@cs.vassar.edu Equipe Langue et Dialogue LORIA/INRIA
More informationCSC 5930/9010: Text Mining GATE Developer Overview
1 CSC 5930/9010: Text Mining GATE Developer Overview Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789 GATE Components 2 We will deal primarily with GATE Developer:
More informationDiagnostic evaluation of MT with DELiC4MT
Diagnostic evaluation of MT with DELiC4MT Final report MT Marathon 2012 Edinburgh, 5th September 2012 Walid Aransa, Luong Ngoc Quang, Antonio Toral Overview Automatic Diagonistic Evaluation on Linguistic
More informationProduction Report of the TRAD Parallel Corpus Chinese-French Translation of a subset of GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06)
Production Report of the TRAD Parallel Corpus Chinese-French Translation of a subset of GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06) This corpus has been produced within the framework of the PEA-TRAD
More informationWeek 2: Syntax Specification, Grammars
CS320 Principles of Programming Languages Week 2: Syntax Specification, Grammars Jingke Li Portland State University Fall 2017 PSU CS320 Fall 17 Week 2: Syntax Specification, Grammars 1/ 62 Words and Sentences
More informationThis corpus has been produced within the framework of the PEA-TRAD project ( )
Production Report of the TRAD Parallel Corpus Arabic-French Translation of a subset of GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) This corpus has been produced within the framework
More informationStudent Guide for Usage of Criterion
Student Guide for Usage of Criterion Criterion is an Online Writing Evaluation service offered by ETS. It is a computer-based scoring program designed to help you think about your writing process and communicate
More informationChapter 4. Processing Text
Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are
More information* Overview. Ontology-Guided Information Extraction from Pathology Reports The SWPatho Project David Schlangen Universität Potsdam
Overview Background of project The task The system Digression: gently machine aided ontology construction Evaluation Future Work -Guided Information Extraction from Pathology Reports The SWPatho Project
More informationContent Based Key-Word Recommender
Content Based Key-Word Recommender Mona Amarnani Student, Computer Science and Engg. Shri Ramdeobaba College of Engineering and Management (SRCOEM), Nagpur, India Dr. C. S. Warnekar Former Principal,Cummins
More informationActivity Report at SYSTRAN S.A.
Activity Report at SYSTRAN S.A. Pierre Senellart September 2003 September 2004 1 Introduction I present here work I have done as a software engineer with SYSTRAN. SYSTRAN is a leading company in machine
More informationTopics for Today. The Last (i.e. Final) Class. Weakly Supervised Approaches. Weakly supervised learning algorithms (for NP coreference resolution)
Topics for Today The Last (i.e. Final) Class Weakly supervised learning algorithms (for NP coreference resolution) Co-training Self-training A look at the semester and related courses Submit the teaching
More informationInformation Extraction Techniques in Terrorism Surveillance
Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism
More informationEurown: an EuroWordNet module for Python
Eurown: an EuroWordNet module for Python Neeme Kahusk Institute of Computer Science University of Tartu, Liivi 2, 50409 Tartu, Estonia neeme.kahusk@ut.ee Abstract The subject of this demo is a Python module
More informationModule 3: GATE and Social Media. Part 4. Named entities
Module 3: GATE and Social Media Part 4. Named entities The 1995-2018 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence Named Entity Recognition Texts frequently
More informationGrammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian
Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian Kiril Simov and Petya Osenova Linguistic Modelling Department, IICT, Bulgarian Academy of Sciences DELPH-IN, Sofia, 2012
More informationTopic 1: Introduction
Recommended Exercises and Readings Topic 1: Introduction From Haskell: The craft of functional programming (3 rd Ed.) Readings: Chapter 1 Chapter 2 1 2 What is a Programming Paradigm? Programming Paradigm:
More informationParmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge
Discover hidden information from your texts! Information overload is a well known issue in the knowledge industry. At the same time most of this information becomes available in natural language which
More informationWh-questions. Ling 567 May 9, 2017
Wh-questions Ling 567 May 9, 2017 Overview Target representation The problem Solution for English Solution for pseudo-english Lab 7 overview Negative auxiliaries interactive debugging Wh-questions: Target
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationThe Dictionary Parsing Project: Steps Toward a Lexicographer s Workstation
The Dictionary Parsing Project: Steps Toward a Lexicographer s Workstation Ken Litkowski ken@clres.com http://www.clres.com http://www.clres.com/dppdemo/index.html Dictionary Parsing Project Purpose: to
More informationAn Integrated Digital Tool for Accessing Language Resources
An Integrated Digital Tool for Accessing Language Resources Anil Kumar Singh, Bharat Ram Ambati Langauge Technologies Research Centre, International Institute of Information Technology Hyderabad, India
More informationCorrelation to Georgia Quality Core Curriculum
1. Strand: Oral Communication Topic: Listening/Speaking Standard: Adapts or changes oral language to fit the situation by following the rules of conversation with peers and adults. 2. Standard: Listens
More informationUniversity of Sheffield, NLP. Chunking Practical Exercise
Chunking Practical Exercise Chunking for NER Chunking, as we saw at the beginning, means finding parts of text This task is often called Named Entity Recognition (NER), in the context of finding person
More informationNLP in practice, an example: Semantic Role Labeling
NLP in practice, an example: Semantic Role Labeling Anders Björkelund Lund University, Dept. of Computer Science anders.bjorkelund@cs.lth.se October 15, 2010 Anders Björkelund NLP in practice, an example:
More informationSimilarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection
Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection Notebook for PAN at CLEF 2012 Arun kumar Jayapal The University of Sheffield, UK arunkumar.jeyapal@gmail.com Abstract
More informationUNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES
UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES Saturday 10 th December 2016 09:30 to 11:30 INSTRUCTIONS
More information