&27L* /,1/D3 D /DQJXDJH,QGHSHQGHQW 1/3$UFKLWHFWXUH XVHGDV*UDPPDU&KHFNHU

Size: px

Start display at page:

Download "&27L* /,1/D3 D /DQJXDJH,QGHSHQGHQW 1/3$UFKLWHFWXUH XVHGDV*UDPPDU&KHFNHU"

Monica Williamson
5 years ago
Views:

1 &27L* /,1/D3 D /DQJXDJH,QGHSHQGHQW 1/3$UFKLWHFWXUH XVHGDV*UDPPDU&KHFNHU )UDQFHVF%HQDYHQW */L&RP 83) 1/36HPLQDU 83& 1RYHPEHUWK,

2 Introduction Architecture Data repr. Modules Discussion,QGH[,QWURGXFWLRQ $UFKLWHFWXUH 'DWDUHSUHVHQWDWLRQ 0RGXOHV 'LVFXVVLRQ,,

3 ,QWURGXFWLRQ Architecture Data repr. Modules Discussion,QWURGXFWLRQ 1.1. Overview 1.2. Technology 1.3. Context and needs

4 ,QWURGXFWLRQ Architecture Data repr. Modules Discussion 2YHUYLHZ 4 Development framework: 4 as surface-oriented linguistic analyzer 4 for robust NLP-enhanced applications 4 In academic and industrial environments 4 Language-independent: 4 linguistic information as external resources 4 currently dealing with Catalan texts 4 Spanish resources are underway

5 ,QWURGXFWLRQ Architecture Data repr. Modules Discussion 7HFKQRORJ\ 4 Programmed in C# and C++ 4 OOP (2EMHFW2ULHQWHG3URJUDPPLQJ) 4 API ($SSOLFDWLRQ3URJUDPPLQJ,QWHUIDFH) 4 Using.Net/Mono platform 4 Linux 4 Windows 4 Mac OS x

6 ,QWURGXFWLRQ Architecture Data repr. Modules Discussion &RQWH[WDQGQHHGV 4 Engine requirements 4 External data driven (detection rules) 4 Broad span of external GUIs (Web, Word, Firefox, ) 4 Reusable to cover future (unknown) applications 4 Project Limitations 4 Short development period (10-12 months) 4 Parallel development (up to 5 programmers)

7 Introduction $UFKLWHFWXUH Data repr. Modules Discussion $UFKLWHFWXUH 2.1. LINLaP 2.2. COTiG 2.3. Encapsulation 2.4. Libraries

8 Introduction $UFKLWHFWXUH Data repr. Modules Discussion /,1/D3 4 Modular: 4 Flexible 4 Customizable 4 Linear: 4 Segmentation 4 Dict. Lookup 4 PoS Tagging 4 Other... 4 Progressive enrichment %5($.(5 /$%(/(5 &+226(5 RWKHU 7H[W6WULQJV 7RNHQ/LVW 7/3R6>Q@ 7/3R6 7/LQIR 3URFHVVLQJ ',&7

9 Introduction $UFKLWHFWXUH Data repr. Modules Discussion &27L* 7H[W6WULQJV 3URFHVVLQJ 'HWHFWLRQ 7<32 63(// *5$0 %5($.(5 /$%(/(5 &+226(5 RWKHU 7RNHQ/LVW 7/3R6 4 2UWKRW\SRJUDSKLF HUURUV 4 2UWKRJUDSKLFHUURUV 4 *UDPPDWLFDOHUURUV 7/LQIR ',&7

10 Introduction $UFKLWHFWXUH Data repr. Modules Discussion (QFDSVXODWLRQ 4 Inner APIs: 4 Independent development 4 Internal testing 4 External evaluation 4 Easy integration 4 Parallel development (dummy modules) 4 Extreme Modularity: 4 Inclusion, combination and extension of modules 4 Freedom for implementation and used techniques (e.g. Handcrafted rules vs. ML-induced models)

11 Introduction $UFKLWHFWXUH Data repr. Modules Discussion /LEUDULHV &RWLJ7RNHQGOO &RWLJ7\SRGOO &RWLJ/DEHOGOO &RWLJ6SHOOGOO &RWLJ&KRRVHGOO &RWLJ 0DLQ &DW GOO *UDSKLF 8VHU,QWHUIDFHV &RWLJ*UDPGOO,&RWLJ0DLQ,&RWLJ! &RWLJ'LFWGOO &RWLJ6KDUHGGOO

12 Introduction Architecture 'DWDUHSU Modules Discussion 'DWDUHSUHVHQWDWLRQ 3.1. Blocks 3.2. Tokens 3.3. Labels 3.4. Errors

13 Introduction Architecture 'DWDUHSU Modules Discussion %ORFNV 3URSHUW\ 7\SH 'HVFULSWLRQ 7H[W <string> Source text %ORFN7\SH <enum> 8: Any/ Controls / Blanks / NoText/ Paragraph / Title / List /Oth.,V3URFHVVDEOH <bool> If this block must be linguistically processed 6SDQ)URP&KDU <int> Starting position in the original document 6SDQ7R&KDU <int> Ending position in the original document /HQJWK,Q&KDUDFWHUV <int> Length of the block 4 Structural unit (block of text) 4 A document is a list of Blocks

14 Introduction Architecture 'DWDUHSU Modules Discussion 7RNHQV 3URSHUW\ 7\SH 'HVFULSWLRQ 6SDQ)URP&KDU <int> Starting position in the original block 6SDQ7R&KDU <int> Ending position in the original block 6RXUFH <string> Source text 1RUPDOL]HG <string> Normalized text 7RNHQ7\SH <enum> 26*: Any / Tag / Break / Punct / Symbl/ Num / Word / Complex /... 7RNHQ6KDSH <enum> 20*: None / Lower / Upper / Title / Plain / FinalDot / StartApos /... 3RVVLEOH/DEHOV <label[]> Once labelled, the list of possible reads %HVW/DEHO <label> Once disambiguated, the most probable label,v&rpsoh[ <bool> If this token has been created by mergin simple tokens 4 Lexical unit (string of chars) 4 A Block is a list of Tokens

15 Introduction Architecture 'DWDUHSU Modules Discussion /DEHOV 3URSHUW\ 7\SH 'HVFULSWLRQ )RUP <string> Form /HPPD <string> Lemma &DWHJRU\ <enum> 44*: Any / Determiner / Adjective / Noun / Pronoun / Verb / Adverb / Conjunction / Interjection / Preposition / Data / Punctuation /... )HDWXUHV <enum> 24*: Gender / Number / Person / Case / Pronoun / Tense / Proper /... 'LDOHFW <enum> 8*: Variant / Style &DQRQLFDO6KDSH <enum> 20: None / Lower / Upper / Title / Plain / FinalDot / StartDash /... )UHTXHQF\ <float> Relative frequency of this lemma in a standard corpus 4 One morphosyntactic reading 4 Used in: tagging and dictionary

16 Introduction Architecture 'DWDUHSU Modules Discussion (UURUV 3URSHUW\ 7\SH 'HVFULSWLRQ 6SDQ)URP7RNHQ <int> Starting position in the current TokenList 6SDQ7R7RNHQ <int> Ending position in the current TokenList &RGH <enum> Classification of the Error: Typo, Spell, Gram; Missing, Wrong, Added; Sign, Form, Phrase; 0HVVDJH <string> Description text displayed to the final user 5XOH <string> Description of the fired rule (for debug pourpouses) 'HWHFWRU <enum> Module that found the error and included it in the ErrorList &RUUHFWLRQV <TokList[]> List of suggested multitoken replacements 4 One specific detected errror (may include suggestions) 4 Created by detection modules

17 Introduction Architecture Data repr. 0RGXOHV Discussion 0RGXOHV 4.1. Dictionary 4.2. Breaker P 4.3. Labeler P Processing 4.4. Chooser P 4.5. Typo D 4.6. Speller D Detection 4.7. Grammar D

18 Introduction Architecture Data repr. 0RGXOHV Discussion 0RGXOHV 'HWHFWLRQ 7<32 63(// *5$0 %5($.(5 /$%(/(5 &+226(5 RWKHU 7H[W6WULQJV 7RNHQ/LVW 7/3R6 3URFHVVLQJ 7/LQIR ',&7

19 Introduction Architecture Data repr. 0RGXOHV Discussion 'LFWLRQDU\ 4 Resource Module: 4 Shared by any module that need lexical information 4 Contains lexical entries with relevant information (morphosyntactic, typographic, stylistic, dialect-related) n 4 Allows direct and inverse searches )RUP /DEHOV>@ /DEHO )RUPV>@ ',&7 ',&7

20 Introduction Architecture Data repr. 0RGXOHV Discussion %UHDNHU 3 El mètode API_Get_Money_(_date_) retorna 14.95_$_. L'_exemple té dues oracions_. Aquesta és la segona_. 4 Segmentation levels: 4 Blocks 4 HTML (tags) 4 Raw text (heuristics) 4 Tokens 4 merging utokens 4 by following contextual rules (grammar) 4 Sentences (ambiguity -> conservative strategy) 4 inserting limit markers 4 by following contextual rules (patterns)

21 Introduction Architecture Data repr. 0RGXOHV Discussion /DEHOHU 3 4,QSXW a list of tokens (PossibleLabels[Ø]) 4 2XWSXW an enriched list of tokens (PossibleLabels[1..n]) 4 Depending on token type: 4 Lexical word: dictionary lookup 4 Non-lexical: pre-determined mapping 4 Implementation 4 As a word form list 4 Indexed by two hash tables

22 Introduction Architecture Data repr. 0RGXOHV Discussion &KRRVHU 3 4,QSXW a list of tokens (BestLabel=Ø) 4 2XWSXW an enriched list of tokens (BestLabel=label m ) 4 Disambiguation task: 4 Choosing the most probable label from candidates 4 Implementation 4 As a standard stochastic tagger (HMM / trigram) 4 Unknown words: add-one smoothing 4 Unknown trigrams: backoff (bigram and unigram)

23 Introduction Architecture Data repr. 0RGXOHV Discussion 0RGXOHV 'HWHFWLRQ 7<32 63(// *5$0 %5($.(5 /$%(/(5 &+226(5 RWKHU 7H[W6WULQJV 7RNHQ/LVW 7/3R6 3URFHVVLQJ 7/LQIR ',&7

24 Introduction Architecture Data repr. 0RGXOHV Discussion 7\SR ' 4,QSXW a list of unlabelled tokens 4 2XWSXW a list of orthotypographic errors 4 Detection task: 4 Find contextual orthotypographic errors (case, spcs, punct...) 4 Implementation 4 Detection: Hard-coded patterns encode rules of common mistakes 4 Suggestions: Hard-coded patterns encode re-generation of token list

25 Introduction Architecture Data repr. 0RGXOHV Discussion 6SHOO ' 4,QSXW a list of labeled tokens (PossibleLabels= Ø label[]) 4 2XWSXW a list of orthographic errors 4 Detection task: 4 Find non-contextual orthographic errors 4 Implementation 4 Detection: words not found in dictionary 4 Suggestion: Inspired in NetSpell (distance)

26 Introduction Architecture Data repr. 0RGXOHV Discussion *UDPPDU ' 4,QSXW a list of disamb. tokens (BestLabel= label m ) 4 2XWSXW a list of grammatical errors 4 Detection task: 4 Find contextual grammatical errors (concordance,...) 4 Implementation 4 Declaring contextual rules manually following a formalism defined for the project

27 Introduction Architecture Data repr. Modules 'LVFXVVLRQ 'LVFXVVLRQ 5.1. On the architecture 5.2. On the modules 5.3. Future improvements 5.4. Conclusions

28 Introduction Architecture Data repr. Modules 'LVFXVVLRQ 2QWKHDUFKLWHFWXUH 4 Advantages: 4 Strict division of levels -> Easier task sharing 4 Strict separation of modules -> Easier development 4 Encapsulation -> Flexibility and smooth integration 4 Object based -> Robustness and efficiency 4 Blackboard inspired -> all modules see all the information 4 Limitations: 4 One-to-one mapping between Tokens<->Labels (issues on contractions and chunk labelling)

29 Introduction Architecture Data repr. Modules 'LVFXVVLRQ 2QWKHPRGXOHV 4 Breaker: level segmentation 4 + expresivity of token patterns 4 - hard-coded patterns (external files are underway) 4 Labeler (Dictionary based) 4 + high-speed tagging (linear time) 4 - high-memory resources (700K words = 100Mb) 4 Chooser 4 + precision and speed 4 - low granularity (currently only 8 PoS)

30 Introduction Architecture Data repr. Modules 'LVFXVVLRQ )XWXUHLPSURYHPHQWV 4 On the modules: 4 Breaker: tokenization rules from external files 4 Chooser: increasing the granularity of PoS 4 On the architecture: 4 Replacement of TokenList for a TokenChart (in order to hold multi-token entities) 4 Adding support for generic n-level annotations

31 Introduction Architecture Data repr. Modules Discussion &RQFOXVLRQV, /,1/D3, an H[SDQGDEOH, PXOWLSODWIRUP, ODQJXDJHLQGHSHQGHQW architecture, development framework for UREXVW1/3 applications. Its architecture makes it easy to FRPELQH different PRGXOHVDQGWHFKQLTXHV, as well as the H[WHQVLRQWRKLJKHUOHYHOV of linguistic description.

32 Introduction Architecture Data repr. Modules Discussion &RQFOXVLRQV,, The FXUUHQWYHUVLRQ handles &DWDODQ texts (6SDQLVK is underway) and is limited to PRUSKRV\QWDFWLF WDJJLQJ It has been DGDSWHGWR context-sensitive HUURUFRUUHFWLRQ, and it will be used as part of an LQIRUPDWLRQH[WUDFWLRQ and an DXWRPDWLFVXPPDUL]DWLRQ system.

33 Introduction Architecture Data repr. Modules Discussion > TXHVWLRQV_ VXJJHVWLRQV_ )UDQFHVF%HQDYHQW */L&RP 83) 1/36HPLQDU 83& 1RYHPEHUWK

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,