CIS 4930 NLP Print Your Name Exam III March 17, 2010 Total Score Your work is to be done individually. The exam is worth 106 points (six points of extra credit are available throughout the exam) and it has twelve questions. Unless a problem directly instructs you differently, there are no known errors within this document. If you are instructed to use specific functionality to solve a problem, then follow the guidelines given. Otherwise, you are allowed to utilize anything from Python modules, provided you include all statements allowing access to such functionality. Here is the simplified Brown Tag Set for your reference. Unless otherwise specified, all corpora will be tagged using the definitions of this set. Tag Meaning Tag Meaning Tag Meaning Tag Meaning ADJ Adjective ADV Adverb CNJ Conjunction DET Determiner EX Existential FW Foreign Word MOD Modal Verb N Noun NP Proper Noun NUM Number PRO Pronoun P Preposition TO The Word to UH Interjection V Verb VD Past Tense VG Present Participle VN Past Participle WH wh Determiner 1. [6 pts] Define and describe the following parts of speech. (a) Noun a person, place or thing (b) Past Participle - the form of a verb used to make perfect tenses and passive forms of verbs; verb form following some form of the verb has 2. [5 pts] Define and describe Bayes Rule. A theorem for finding the probability of a fact A being true given that fact B is true. 3. [5 pts] Define and describe the Null Hypothesis Test. The technique of setting up a hypothesis to be nullified or refuted in order to support an alternative hypothesis.
March 17, 2010 CIS 4930 Exam III Page 2 of 6 Score 4. [6 pts] State the formula for Pearson s Chi Square Test. 5. [6 pts] Using the values: a total of 5,000 total tokens on the course schedule page, CIS occurring 48 times, 4930 occurring 11 times, and CIS 4930 occuring 10 times, create the table of data used by Pearson s Chi Square Test. CIS!CIS 4930 10 1!4930 38 4950 6. [6 pts] We would like to calculate the mean differential between the tokens 4930 and 4905 when each is preceded by CIS. In the Spring 2010 course schedule, CIS occurs 48 times, CIS 4930 occurs 10 times, and CIS 4905 occurs 1 time. Resolve your calculation as much as you can by hand, you may leave your result in a fractional form. C(w 1 w) C(w 2 w) / sqrt(c(w 1 w) + C(w 2 w)) = 10 1 / sqrt(10 + 1) = 9/sqrt(11)
March 17, 2010 CIS 4930 Exam III Page 3 of 6 Score 7. [8 pts] A tagger exists within the file: Tagger.pkl. Show how to read this tagger into your program for re-use. from cpickle import load input = open( Tagger.pkl, rb ) tagger = load(input) input.close() 8. [8 pts] You are given a list of tagged sentences called training. Show how to create a bigram tagger using this set of training data and the tagger you read in from the prior question as your backoff tagger. t1 = nltk.bigramtagger(training, backoff=tagger) 9. [4 pts] Given a list of untagged data called data, show how to tag this data using the tagger you created in the prior question. t1.tag(data)
March 17, 2010 CIS 4930 Exam III Page 4 of 6 Score 10. [16 pts] Create a method that will receive a tagged corpus and a specific tag. The method will search the corpus for the tag that most commonly follows the tag received. Return a list composed of: the specified tag, the most commonly following tag, and the frequency with which the most common tag follows the specified tag. def findmostcommonfollower(tagged_corpus, specified) : from collections import defaultdict words = tagged_corpus.tagged_words(simplify_tags=true) followers = defaultdict(int) length = len(words) totalcount = 0 for i in range(length) : if words[i][1] == specified : totalcount += 1 if i!= length - 1 : followers[words[i + 1][1]] += 1 max = 0 maxtag = 0 for each in followers.keys() : nextcount = followers[each] if nextcount > max : max = nextcount maxtag = each return [specified, maxtag, (max + 0.0) / totalcount]
March 17, 2010 CIS 4930 Exam III Page 5 of 6 Score 11. [20 pts] Prepositional phrases are made up a preposition and some set of following words, and are ended with a noun (the object of the preposition). Create a method that will receive a tagged corpus. The method will search the corpus for all prepositions and return a list of tuples (or sub-lists) containing: the preposition, the entire prepositional phrase (including preposition), and the number of tokens (words) within the prepositional phrase. Consider: the drink spilled from my glass and landed on my new shoes, your method will return: [( from, from my glass, 3), ( on, on my new shoes, 4)]. def findprepositions(tagged_corpus) : words = tagged_corpus.tagged_words(simplify_tags=true) results = [] lastprep = None count = 0 phrase = '' for each in words : if each[1] == 'P' : lastprep = each[0] count = 1 phrase = lastprep elif lastprep!= None : phrase += ' ' + each[0] count += 1 if each[1] == 'N' : results.append([lastprep, phrase, count]) lastprep = None return results
March 17, 2010 CIS 4930 Exam III Page 6 of 6 Score 12. [16 pts] The feminine pronouns are: she, her, herself, and hers and the masculine pronouns are: he, him, himself, and his. Create a method that will receive a tagged corpus and return the ratio of feminine pronouns to masculine pronouns. def findgenderratio(tagged_corpus) : words = tagged_corpus.tagged_words(simplify_tags=true) feminine = 0 masculine = 0 for each in words : if each[1] == 'PRO' : if each[0] == 'she' or each[0] == 'her' or each[0] == 'herself' or each[0] == 'hers' : feminine += 1 elif each[0] == 'he' or each[0] == 'him' or each[0] == 'himself' or each[0] == 'his' : masculine += 1 return (feminine + 0.0) / masculine