Corpus Linguistics. Seminar Resources for Computational Linguists SS Magdalena Wolska & Michaela Regneri

Size: px

Start display at page:

Download "Corpus Linguistics. Seminar Resources for Computational Linguists SS Magdalena Wolska & Michaela Regneri"

Iris Evans
6 years ago
Views:

1 Seminar Resources for Computational Linguists SS 2007 Magdalena Wolska & Michaela Regneri

2 Armchair Linguists vs. Corpus Linguists Competence Performance 2

3 Motivation (for ) 3

4 Outline Corpora Annotation Data Analysis The Web as Corpus 4

5 Outline Corpora Annotation Data Analysis The Web as Corpus 5

6 Corpus - definition in principle: every collection of text (desired or necessary) properties of corpora for linguistic processing: representativeness finite size (mostly) machine-readability standard reference 6

7 Corpus - properties language mode (speech vs. text) languages and alignment: mono-/bilingual, comparable/parallel balance: homo-/heterogeneous, balanced/unbalanced annotation: plain/annotated, annotation type and depth text types (newspapers, novels, phone calls...) text domains (business, finance, love stories...) date / time span (of texts used) size 7

8 Outline Corpora Annotation Data Analysis The Web as Corpus 8

9 Annotation - principles linguistic information in a corpus maxims of annotation (Leech 1993): removable and extractable annotation guidelines available to end user awareness of fallibility (but potential usefulness) scheme should be based on widely-agreed principles which are theory-neutral 9

10 Annotation - Data Format Often variants of XML: stand-off: The dog barks. <sentence> <phrase type= NP > <word ind= 1 pos= det /> <word ind= 2 pos= N /> </phrase> <phrase type= VP > <word ind= 3 pos= VI /> </phrase> </sentence> inline: <sentence> <phrase type= NP > <word pos= det >the</word> <word pos= N >dog</word> </phrase> <phrase type = VP > <word pos= VI >barks</word> </phrase> </sentence> 10

11 Annotation - examples: Treebanks (syntax) 11

12 Annotation - examples: semantic roles (SALSA) 12

13 Annotation - examples: discourse structure 13

14 Annotation - Tools Graphical UIs, similar to output, for drawing annotations Example: RSTTool 14

15 Outline Corpora Annotation Data Analysis The Web as Corpus 15

16 Data Analysis Word counts (word frequency, token per type ) concordance: same word in different contexts La Streisand sounded just like the student activist she played in the film T s pilot's wings, he was judged top student. After his weapons training, he w with Der Bettelstudent (The Beggar Student) and Gasparone in the fairly rece S.LOWRY: THE MAN AND HIS ART: As a student and long-time resident of Salford Antony Fleat, a second-year law student at Oxford Brookes University; and t oung life. This second-year student at Robert Gordon's university in Ab erdeen, having matriculated as a student at Robert Gordon's university. In M, 78, from Harrow, an anthropology student at the University of the Third Ag he had enough of London as a law student at University College and the Colle n-grams: count the frequencies of word combinations of n words 3868 vergehen Jahr 1184 kommen Jahr 2385 neu Land 1181 jung Mann 2378 letzt Jahr 1107 groß Teil 2296 nah Jahr 997 lang Zeit 1398 erst Mal 986 nah Woche 16

17 Data Analysis - Information Access pattern matching with query languages like CQP: Query: [lemma="dog"] [pos!="\$.*"]* [pos="nn"] within s; Examples: dog for her daughter dogs on the street dogs and their leashes dog with a cruel owner 17

18 Outline Corpora Annotation Data Analysis The Web as Corpus 18

19 The web as corpus the web is a collection of text, thus it is a corpus the largest available corpus: more than 7.2*10 11 words (10 times bigger than the English Gigaword Corpus [Liu and Curran 2006]) nearly all kinds of text and lots of languages present not preprocessed, lots of ungrammatical (and linguistically useless) text how to access it? 19

20 The web as corpus Document counts are shown to correlate directly with real frequencies (Keller 2003), so search engines can help - but... lots of repetitions of the same text (not representative) very limited query precision (no upper/lower case, no punctuation...) only estimated counts, often hart to reproduce exactly how to access Google? :) (Google API, Scripts) Alexa: buy (parts of) web, and process it on their machines 20

21 The web as corpus - examples Extracting and filtering web documents to create linguistically annotated corpora (Kilgarriff 2006) gather documents for different topics (balance!) exclude documents which cannot be preprocessed with available tools (here taggers and lemmatizers) exclude documents which seem irrelevant for a corpus (too short or too long, word lists,...) do this for several languages and make the corpora available 21

22 The web as corpus - examples Directly using web counts (instead of corpus counts), e.g. VerbOcean (Chklovski & Pantel 2004, see ) gather verb pairs which are semantically related but the relation is unknown --> DIRT (Lin and Pantel 2001) example pair: love -- marry pick a semantic relation (e.g. happens-before ) and design typical patterns for this relation (e.g. to X and then Y ) instantiate the patterns ( to love and then marry ) and count Google hits (here: 6) estimate whether or not the number of hits indicates a significant correlation, then assign the relation (or not) 22

23 References Thanks to Sabine Schulte im Walde & Magdalena Wolska for some slides Literature: McEnery & Wilson (1996):. Edinburgh University Press. (See Chklovski & Pantel (2004): VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations. In Proceedings of EMNLP-04. Keller (2003): Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, , Nr. 3, Baroni and Kilgarriff (2006): Large linguistically-processed Web Corpora for multiple languages. In Proceedings of EACL Leech (1993): Corpus annotation schemes. Literary and Linguistic Computing 8(4): Lin and Pantel (2001): DIRT Discovery of Inference Rules from Text. In Proceedings of KDD-01. Liu and Curran (2006): Web Text Corpus for Natural Language Processing. In Proceedings of EACL

24 References Some Corpora: Brown: LOB: BNC: (online search: TIGER: Penn Treebank: Penn Discourse Treebank: Prague Dependency Treebank: - Michaela Regneri 24

HG2052 Language, Technology and the Internet. The Web as Corpus

HG2052 Language, Technology and the Internet The Web as Corpus Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ bond@ieee.org Lecture 7 Location: S3.2-B3-06