Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Size: px

Start display at page:

Download "Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017"

Charlotte Morton
6 years ago
Views:

1 Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

2 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 2/49

3 Tokenization Why do we need to segment a text in NLP? Because a piece of text is more than a long string of characters, it contains basic linguistic units that we are interested in: Words (that we must deal with lexicons) Sentences (that separate unit of sense) Paragraphs... For a human: Bob eats apples. For a computer: Grundläggande textanalys 3/49

4 Tokenization Why do we need to segment a text in NLP? Because a piece of text is more than a long string of characters, it contains basic linguistic units that we are interested in: Words (that we must deal with lexicons) Sentences (that separate unit of sense) Paragraphs... For a human: Bob eats apples. For a computer: Grundläggande textanalys 3/49

5 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 4/49

6 Tokenization Definition Segmenting text into linguistic units, such as words, punctuations and numbers. For example: Bob eats apples. Contains 4 tokens: Bob eats apples. Grundläggande textanalys 5/49

7 Warm Up QUIZ Question Can you tell me how many tokens contains this extract (count everything that is bold): The wolf said: "Little pig, little pig, let me come in." Grundläggande textanalys 6/49

8 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 7/49

9 Tokenization Is tokenization trivial? 11:45 p.m. Ugh. First day of New Year has been day of horror. Can t quite believe I am once again starting the year in a single bed in my mom s house. [...]When I got to the Alconburys and rang their entire-tune-of-town-hall-clock-style doorbell I was still in a strange world of my own nauseous, vile-headed, acidic. What if... Should we rely only on spaces? Grundläggande textanalys 8/49

10 Tokenization Is tokenization trivial? 11:45 p.m. Ugh. First day of New Year has been day of horror. Can t quite believe I am once again starting the year in a single bed in my mom s house. [...]When I got to the Alconburys and rang their entire-tune-of-town-hall-clock-style doorbell I was still in a strange world of my own nauseous, vile-headed, acidic. What if... Should we rely only on spaces? Grundläggande textanalys 8/49

11 Hyphens True hyphens: Lexical hyphen: "anti-discriminatory" "twenty-two" Sententially determined hyphenation: "their entire-tune-of-town-hall-clock-style doorbell" "New York-based" End-of-line hyphens as line breakers: Word segmentation and part-of-speech (POS) tagging are core steps for higher-level natural language processing (NLP) tasks. Grundläggande textanalys 9/49

12 It is indeed tricky... There is not a straight forward way to deal with hyphens Context dependant: twenty-two vs New York-based Task dependent: New York-based cannot be tokenized the same way if we do Part-of-Speech, Information Retrieval, translation... Language dependent problem Grundläggande textanalys 10/49

13 Tokenization Some other special tokens: C++ addresses Phone number ( ) City names (San Francisco, New York) date 23/05/14, 23-May-2014 Compounds (statsminister)... Can be processed with Rule-based system, dictionary, etc Machine learning approaches Grundläggande textanalys 11/49

14 Apostrophe Mr. O Neill thinks that the boys stories about Chile s capital aren t amusing. The rules for apostrophes are: Language specific Context sensitive: "O Neill" vs "Chile"+" s" Grundläggande textanalys 12/49

15 Apostrophe How is it dealt in softwares? Here is an example of how some apostrophes are handled in a well known tagger: Figure: Regex for clitics in TreeTagger s tokenizer Grundläggande textanalys 13/49

16 Chinese Word Segmentation 结合成分子时结合 (combine) 成 (finish) 分子 (molecular) 时 (when) Meaning: When combined into moleculars 结合 combine 合成 compose 成分 component 分子 molecular 子时 midnight 2 5 = 32 结 joint, end 合 add, close 成 finish, success 分 divide, minute 子 son, child, seed 时 time, when possible ways

17 Chinese Word Segmentation 结合成分子时 Tags: B E S B E S

18 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 14/49

19 Is this text tokenized? The wolf said : " Little pig, let me come in. " Grundläggande textanalys 15/49

20 Is this text tokenized? The wolf said : " Little pig, let me come in. " Note For machine translation purpose Grundläggande textanalys 15/49

21 Is this text tokenized? The wolf said: "Little pig, let me come in." Grundläggande textanalys 16/49

22 Is this text tokenized? <tok>the</tok> <tok>wolf</tok> <tok>said</tok> <tok>:</tok> <tok>"</tok> <tok>little</tok> <tok>pig</tok> <tok>,</tok> <tok>little</tok> <tok>pig</tok> <tok>,</tok> <tok>let</tok> <tok>me</tok> <tok>come</tok> <tok>in</tok> <tok>.</tok> <tok>"</tok> Grundläggande textanalys 17/49

23 Is this text tokenized? <tok>the</tok> <tok>wolf</tok> <tok>said</tok> <tok>:</tok> <tok>"</tok> <tok>little</tok> <tok>pig</tok> <tok>,</tok> <tok>little</tok> <tok>pig</tok> <tok>,</tok> <tok>let</tok> <tok>me</tok> <tok>come</tok> <tok>in</tok> <tok>.</tok> <tok>"</tok> Note tokenization separators for XML document Grundläggande textanalys 17/49

24 Is this text tokenized? The wolf said : " Little pig, let me come in. " Grundläggande textanalys 18/49

25 Is this text tokenized? The wolf said : " Little pig, let me come in. " Note tokenization separators for tagging purpose Grundläggande textanalys 18/49

26 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 19/49

27 Evaluation Choose between two tokenizers based on some observations: 1 Observation 1: Tokenizer 1 systematically tokenizes wrong the punctuations (for instance it gives the token cheese! instead of the tokens cheese and! ) whereas Tokenizer 2 only gets wrong on tokenizing some hyphens which is a less recurrent mistake. 2 Observation 2: Tokenizer 1 divides the text into 1150 tokens whereas Tokenizer 2 divides into 1100 and manually checked the text has 1000 tokens is closer to Grundläggande textanalys 20/49

28 Evaluation How can we explicitly evaluate the accuracy of a tokenizer? Take a simple example, non-ambiguous: I eat an apple. Manually create the gold standard tokenization of this example. I eat an apple. Grundläggande textanalys 21/49

29 Evaluation How can we explicitly evaluate the accuracy of a tokenizer? Take a simple example, non-ambiguous: I eat an apple. Manually create the gold standard tokenization of this example. I eat an apple. Grundläggande textanalys 21/49

30 Evaluation Now we have some output of several tokenizers. a I eat an apple. b I eat a n apple. c I eat a n a p p l e. d I eat an apple. All of them are wrong, but some are worse that others. Grundläggande textanalys 22/49

31 Evaluation Standard Evaluation Matrix for most NLP tasks Recall: number of tokens that are right / number of tokens to be produced according to the gold standard. Precision: number of tokens that are right / number of tokens that were automatically generated. F1-score: 2 Recall Precision / (Recall + Precision) Grundläggande textanalys 23/49

32 Evaluation Gold standard: I eat an apple. 1 I eat a n a p p l e. 2 I eat an apple. 1 Recall=2/ 5 =40% Precision=2/(2+7)=2/9=22% 2 Recall=1/ 5 =20% Precision=1/(1+1)=1/2=50% Grundläggande textanalys 24/49

33 Evaluation Gold standard: I eat an apple. a I eat an apple. b I eat a n apple. c I eat a n a p p l e. d I eat an apple.. Recall:? Precision:? F1-score:? Grundläggande textanalys 25/49

34 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 26/49

35 Summary Tokenization is: a basic task that almost every NLP task requires for Western European languages relatively well handled (and specific problems through other languages: germanic, oriental etc.) Tokenization is a universal task (as many in NLP) The accuracy has a significant impact on the higher-level tasks. Grundläggande textanalys 27/49

36 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 28/49

37 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 29/49

38 Sentence Segmentation Introduction Sentence Segmentation Sentence segmentation find sentence boundaries between words in different sentences. Since most written languages have punctuation marks which occur at sentence boundaries, sentence segmentation is frequently referred to as sentence boundary detection, sentence boundary disambiguation, or sentence boundary recognition. All these terms refer to the same task: determining how a text should be divided into sentences for further processing. Grundläggande textanalys 30/49

39 Why sentence segmentation? Preprocessing step for higher-level task Useful for: Syntactical Parsing, Information Extraction, Machine Translation, Text Alignment Not (very) trivial The basic algorithm [.?!][ ()"]+[A-Z] (which means select every punctuation followed by a space and an upper-case. ) provokes 6.5% errors on the Brown corpus and WSJ Grundläggande textanalys 31/49

40 Sentence Segmentation Dot. is ambiguous Decimal ex Abreviations (your result will be influenced by the number of them) Named entity: Web sites, IP address ( :3128, etc.) Grundläggande textanalys 32/49

41 Sentence Segmentation Different sentence segmentation approaches Machine learning based: Unsupervised they can be trained on any language and genre. Supervised they require annotation of a training set. We will cover some rule-based methods Grundläggande textanalys 33/49

42 Sentence Segmentation Rule based systems Take the local context of the punctuation into account: regular expressions [.?!][ ()"]+[A-Z] or finite states transducers Grundläggande textanalys 34/49

43 Rule based systems Rule based systems Moreover: A list of frequent starting words (The, He, However...) A list of abbreviations (Mr., Av., a.m., A.S.A.P) Grundläggande textanalys 35/49

44 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 36/49

45 Is this sentence segmented? Welcome! Can I help you? I m fine. Grundläggande textanalys 37/49

46 Is this sentence segmented? Welcome! Can I help you? I m fine. Grundläggande textanalys 38/49

47 Is this sentence segmented? Welcome! Can I help you? I m fine. Grundläggande textanalys 39/49

48 Is this sentence segmented? Welcome! Can I help you? I m fine. Grundläggande textanalys 40/49

49 Is this sentence segmented? <sent><tok>welcome</tok><tok>! </tok></sent><sent><tok> Can </tok> <tok> I </tok><tok> help </tok><tok> you </tok><tok>?</tok></sent> Grundläggande textanalys 41/49

50 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 42/49

51 Summary Similarly to tokenization, there is no universal way for sentence segmentation Language specific Genre specific Grundläggande textanalys 43/49

52 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 44/49

53 Summary Two NLP preprocessing: - Tokenization - Sentence segmentation They vary across different genres, NLP tasks, as well as languages Some ambiguities are common and need to be handle such as hyphenation and dots Their accuracies have significant impacts on higher-level NLP tasks. (Parsing for instance).foster [2010] Grundläggande textanalys 45/49

54 Table of Contents 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation Introduction Exercise Summary 3 Summary 4 Practical Guidance Grundläggande textanalys 46/49

55 Practical Guidance How to solve a problem in the lab sessions 1 Make sure you understand what you need to do on it (e.g., Tokenise) 2 Make some simple example(s) (e.g., I eat an apple) 3 Solve the simple samples by hand. (e.g., I eat an apple.) 4 If you think everything is clear, make the program that do the task. 5 Test your program on your examples before applying to the real data 6 If the output is correct, apply to the real data for further testing. Grundläggande textanalys 47/49

56 Practical Guidance How to write a good lab report Give a global overview of the results ( %, numbers) Exemplify enough to make your point. Give a consistent information presentation to the reader (ex. Write all examples in italic, put data into tables for comparisons etc.) Target your answer to the question. Make sure the information you provide is meaningful and clear out of its context. Grundläggande textanalys 48/49

57 References Jennifer Foster. cba to check the spelling : Investigating Parser Performance on Discussion Forum Posts. In Proceedings of the NAACL-HLT, pages Association for Computational Linguistics, URL Ruslan Mitkov. The Oxford Handbook of Computational Linguistics, volume 0 of Oxford Handbooks in Linguistics. Oxford University Press, URL com/books?hl=en&lr=&id=oaclhre-vw4c&pgis=1. Grundläggande textanalys 49/49

Information Retrieval and Organisation

Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing