Lab 12: Processing a Corpus. Ling 1330/2330: Computational Linguistics Na-Rae Han

Size: px

Start display at page:

Download "Lab 12: Processing a Corpus. Ling 1330/2330: Computational Linguistics Na-Rae Han"

Anthony James
5 years ago
Views:

1 Lab 12: Processing a Corpus Ling 1330/2330: Computational Linguistics Na-Rae Han

2 Objectives How to process a corpus 10/4/2018 2

3 Beyond a single, short text So far, we have been handling relatively short texts, one at a time. Going multiple Find out what's involved in processing a text archive of multiple text files (aka corpus) Let's try this today Going big Find out what's involved in processing HUMONGUOUS text files 10/4/2018 3

Processing multiple texts From the NLTK Corpora page, download: C-Span Inaugural Address Corpus http://www.nltk.

4 Processing multiple texts From the NLTK Corpora page, download: C-Span Inaugural Address Corpus The C-Span Inaugural Address Corpus Includes 56 past presidential inaugural address, from 1789 (Washington) to 2009 (Obama). The directory has 56.txt files and one README file. QUESTION: How do we effectively process this many files? 10/4/2018 4

5 Corpus vs. sub-corpora Sub-corpus 1 Sub-corpus 2 Entire Corpus 10/4/2018 5

6 Big token lists for sub-corpora text text text text text text text text sub-corpus 1 TOKENS Good when individual texts don't need separate attention. sub-corpus 2 TOKENS 10/4/2018 6

7 Pools & individual token lists text text text text text text text text tokens tokens tokens tokens tokens tokens tokens tokens sub-corpus 1 TOKENS Individual token lists as well as sub-corpus pools sub-corpus 2 TOKENS 10/4/2018 7

8 Using glob glob: a file-name globbing utility Returns a list of file names that match the specified pattern >>> import glob >>> files = glob.glob(r'd:\lab\inaugural\*.txt') >>> len(files) 56 >>> files[:5] ['D:\\Lab\\inaugural\\1789-Washington.txt', 'D:\\Lab\\inaugural\\1793-Washington.txt', 'D:\\Lab\\inaugural\\1797-Adams.txt', 'D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt'] >>> files[-1] 'D:\\Lab\\inaugural\\2009-Obama.txt' >>> All files ending in.txt Excludes README 10/4/2018 8

9 Using glob Addresses from 1800's only >>> files2 = glob.glob(r'd:\lab\inaugural\18*.txt') >>> len(files2) 25 >>> files2[:5] ['D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt', 'D:\\Lab\\inaugural\\1809-Madison.txt', 'D:\\Lab\\inaugural\\1813-Madison.txt', 'D:\\Lab\\inaugural\\1817-Monroe.txt'] >>> files2[-1] 'D:\\Lab\\inaugural\\1897-McKinley.txt' >>> All files starting with '18' and ending with '.txt' 10/4/2018 9

10 Build dictionary of texts For-loop through file names and build a dictionary of key (filename): value (text content) >>> files[0] 'D:\\Lab\\inaugural\\1789-Washington.txt' >>> files[0][12:-4] 'ural\\1789-washington' >>> files[0][17:-4] '1789-Washington' >>> files[2][17:-4] '1797-Adams' >>> files[0].index('\\') 2 >>> files[0].rindex('\\') 16 >>> files[0][files[0].rindex('\\')+1:-4] '1789-Washington' Full name is too long. How to extract this? Gets the job done Highest index of '\' (Windows dir separator) This is the more principled way of extracting the short file name 10/4/

11 Build dictionary of texts For-loop through file names and build a dictionary of key (filename): value (text content) >>> files[0][files[0].rindex('\\')+1:-4] '1789-Washington' >>> fn2txt = {} >>> for longname in files: f = open(longname) txt = f.read() f.close() start = longname.rindex('\\')+1 short = longname[start:-4] fn2txt[short] = txt >>> fn2txt['1809-madison'][:40] 'Unwilling to depart from examples of the' >>> fn2txt['1789-washington'][:40] 'Fellow-Citizens of the Senate and of the' fn2txt file name as key, text string as value 10/4/

12 Processing each text Task: Compute the average sentence length for each presidential address. We have to build separate token lists for each speech. >>> fn2toks = {} >>> for (fn, txt) in fn2txt.items(): toks = textstats.gettokens(txt) fn2toks[fn] = toks fn2toks file name as key, token list as value >>> len(fn2toks) 56 >>> fn2toks['1789-washington'] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':',... >>> fn2toks['2001-bush'][:10] ['president', 'clinton', ',', 'distinguished', 'guests', 'and', 'my', 'fellow', 'citizens', ','] 12

13 Speech length, 'peace' count >>> for fn in fn2toks: toks = fn2toks[fn] print(len(toks), fn) Washington Washington Obama >>> for fn in fn2toks: toks = fn2toks[fn] print(toks.count('peace'), '\t', fn) Eisenhower Eisenhower Kennedy Johnson Nixon Nixon Carter 13

14 Average sentence length, per address >>> for fn in fn2toks: toks = fn2toks[fn] sentcount = toks.count('.') + toks.count('!') \ + toks.count('?') avgsentlen = len(toks)/sentcount print(avgsentlen, '\t', fn) Washington Washington Adams Jefferson Jefferson Madison Bush Bush Obama >>> Assumes every sentence ends with '.', '!', or '?' 14

15 Treating files as a single corpus Task: Compile word frequency of the Inaugural Speeches. For this, we only need to build a single pool of tokenized words. For each text, tokenize it, and then add the result to the pool of tokenized words. >>> import textstats >>> alltoks = [] >>> for txt in fn2txt.values(): toks = textstats.gettokens(txt) alltoks.extend(toks) >>> len(alltoks) >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the'] >>> alltoks[-15:] ['you', '.', 'god', 'bless', 'you', '.', 'and', 'god', 'bless', 'the', 'united', 'states', 'of', 'america', '.'] 15

16 Word frequency of entire corpus >>> allfreq = textstats.getfreq(alltoks) >>> allfreq['citizens'] 237 >>> allfreq['battle'] 12 >>> for k in sorted(allfreq, key=allfreq.get, reverse=true)[:10]: print(k, allfreq[k]) the 9906 of 6986, 6862 and to 4432 in 2749 a 2193 our 2058 that 1726 >>> 16

17 Treating files as a single corpus, take 2 Task: Compile word frequency of the Inaugural Speeches. Alternative approach: join all text strings into a single gigantic text string And then, tokenize it all at once. >>> alltxt = '\n'.join(fn2txt.values()) All speech texts, concatenated with a line break in between >>> alltoks = textstats.gettokens(alltxt) >>> len(alltoks) >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the'] 17

Handout 12: Textual models

Handout 12: Textual models Taylor Arnold Loading and parsing the data The full text of all the State of the Union addresses through 2016 are available in the R package sotu, available on CRAN. The package