Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

Size: px

Start display at page:

Download "Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit"

Godwin Haynes
5 years ago
Views:

1 Data for linguistics ALEXIS DIMITRIADIS

2 Text, corpora, and data in the wild

3 1. Where does language data come from? The usual: Introspection, questionnaires, etc. Corpora, suited to the domain of study: Written, spoken, child-directed etc. Databases with already-distilled information. The usual: Google.

4 2. Searching for real examples Using google, we can check whether particular constructions are really impossible: 1. a. John saw a snake near him b. * John saw a snake near himself Google: near him hits near himself hits That s a ratio of 141 : 1, or about 0.7%. But not negligible! What are the hits like?

5 Top hits for near himself Archaic/biblical examples, but not only

7 3. Structured corpora In the domain of Language Resources, high-quality corpora and other resources are created for various (usually computational) purposes. Considerations: Selection of materials in the corpus: Quality, balance. Clean-up, segmentation into sentences, metadata. Tagging, parsing, and other annotations. Other language resources: Parallel corpora, dictionaries, collocation lists, wordnets,... Corpora are easy to find on the web. Many cost money, some only require registration.

8 4. Some free corpora Brown corpus (1961) 500 sources, categorized in diverse genres. One million words. Gutenberg project: Contains 25,000 free electronic books. Countless other specialized corpora: Legal texts, medical texts, child language, L2 learners of English,... Child Language Data Exchange System (CHILDES)

9 5. The British National Corpus (BNC) A 100 million word collection from a wide range of sources, written and spoken. Designed to represent a wide cross-section of British English from the late 20th century. Written part(90%): extracts from newspapers, periodicals for all ages and interests, academic books and popular fiction, published and unpublished letters, etc. Spoken part (10%): Transcriptions of unscripted informal conversations and spoken language, in contexts ranging from formal business or government meetings to radio shows and phone-ins. Tagged (automatically) with part of speech.

10 6. Dutch corpora From the Instituut voor Nederlandse Lexicologie (INL): 38 Miljoen Woorden Corpus (and many smaller collections) Corpus Gesproken Nederlands (CGN) Alpino treebank (parsed corpus). More than 150,000 words

11 7. Rolling our own With so much data on the web, it s easy to collect as much data as a linguist could conceivably need. Automatic annotation tools can help us search our data more easily. Compiling and annotating a corpus does require a time investment. An on-line tagger for Dutch text (when it works)

12 8. The power of the web-crawling approach Online Database of Interlinear Text (ODIN)

13 Data and databases

14 9. Managing linguistic data 1. Keep it in Word documents Easy to get started; can store any kind of information. But: Hard to count, sort, or get an overview of contents. Only one person at a time can edit the data. 2. Use a spreadsheet (Excel) Can store tabular information (only), sort, and calculate statistics. Simple queries. Limited options for display. Only suited for tabular data. One editor at a time. 3. Use a database Powerful: open-ended display, collaborative data entry, full queries. Complex to set up. Stucture imposed on contents may be too restrictive.

15 Managing linguistic data II For messier data collections, we need more flexibility Keep the data in text files (not Word documents) Search and manage the data as needed, using a variety of tools. Python is a flexible programming language; the Natural Language Toolkit (NLTK) gives us numerous tools we can use to explore text. It is still difficult for multiple people to work on the same collection; but not as difficult as with a single document.

16 10. Notable cross-linguistic databases Directory of the world s languages: The World Atlas of Language Structures (WALS) A collection of typological databases: New server: A different kind of collection: Online Database of Interlinear Text (ODIN)

17 More cross-linguistic databases A simple, focused cross-linguistic survey: The Berlin intensifier database Some more sophisticated examples: The Surrey databases Our own reciprocals database:

18 Contents 1 Where does language data come from? Searching for real examples Structured corpora Some free corpora The British National Corpus (BNC) Dutch corpora Rolling our own The power of the web-crawling approach Managing linguistic data Notable cross-linguistic databases

LING203: Corpus. March 9, 2009

LING203: Corpus. March 9, 2009 LING203: Corpus March 9, 2009 Corpus A collection of machine readable texts SJSU LLD have many corpora http://linguistics.sjsu.edu/bin/view/public/chltcorpora Each corpus has a link to a description page