Text Generator Due Sunday, September 17, 2017

Similar documents
CS451 - Assignment 3 Perceptron Learning Algorithm

Binary Trees Due Sunday March 16, 2014

CS159 - Assignment 2b

CS 134 Programming Exercise 2:

Java Program Structure and Eclipse. Overview. Eclipse Projects and Project Structure. COMP 210: Object-Oriented Programming Lecture Notes 1

CS 051 Homework Laboratory #2

Lab 1: Silver Dollar Game 1 CSCI 2101B Fall 2018

CSE 143: Computer Programming II Summer 2017 HW4: Grammar (due Tuesday, July :30pm)

CMPSCI 187 / Spring 2015 Hangman

CMPSCI 187 / Spring 2015 Sorting Kata

Documentation Requirements Computer Science 2334 Spring 2016

CMPSCI 187 / Spring 2015 Hanoi

King Abdulaziz University Faculty of Computing and Information Technology Computer Science Department

ArrayList. Introduction. java.util.arraylist

Project 3: Implementing a List Map

CSCE 156 Computer Science II

CS451 - Assignment 8 Faster Naive Bayes? Say it ain t so...

Lab 0 Introduction to the MSP430F5529 Launchpad-based Lab Board and Code Composer Studio

CS201 - Assignment 7 Due: Wednesday April 16, at the beginning of class

Due: 9 February 2017 at 1159pm (2359, Pacific Standard Time)

CS161: Introduction to Computer Science Homework Assignment 10 Due: Monday 11/28 by 11:59pm

Project 8: Virtual Machine Translator II

Homework 2: Imperative Due: 5:00 PM, Feb 15, 2019

Family Map Server Specification

Assignment #1: /Survey and Karel the Robot Karel problems due: 1:30pm on Friday, October 7th

Mehran Sahami Handout #7 CS 106A September 24, 2014

C152 Laboratory Exercise 1

Assignment #1 Simple C++

Tips from the experts: How to waste a lot of time on this assignment

CSE 143: Computer Programming II Summer 2017 HW5: Anagrams (due Thursday, August 3, :30pm)

Deliverables. Problem Description

Table of Laplace Transforms

CMPSCI 187 / Spring 2015 Implementing Sets Using Linked Lists

Assignment #1: and Karel the Robot Karel problems due: 3:15pm on Friday, October 4th due: 11:59pm on Sunday, October 6th

King Abdulaziz University Faculty of Computing and Information Technology Computer Science Department

Prerequisites for Eclipse

Jerry Cain Handout #5 CS 106AJ September 30, Using JSKarel

Assignment #2: Intro to Java Due: 11AM PST on Wednesday, July 12

Tempo Client Gateway Course Handout by the Houston Association of Realtors

CS211 Computers and Programming Matthew Harris and Alexa Sharp July 9, Boggle

Compuer Science 62 Assignment 10

CS112 Lecture: Defining Classes. 1. To describe the process of defining an instantiable class

Lab 4: Super Sudoku Solver CSCI 2101 Fall 2017

************ THIS PROGRAM IS NOT ELIGIBLE FOR LATE SUBMISSION. ALL SUBMISSIONS MUST BE RECEIVED BY THE DUE DATE/TIME INDICATED ABOVE HERE

CSE143X: Computer Programming I & II Programming Assignment #9 due: Monday, 11/27/17, 11:00 pm

CSE 332: Data Structures and Parallelism Winter 2019 Setting Up Your CSE 332 Environment

Array Based Lists. Collections

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 3: SEP. 13TH INSTRUCTOR: JIAYIN WANG

King Abdulaziz University Faculty of Computing and Information Technology Computer Science Department

Tips from the experts: How to waste a lot of time on this assignment

Keep Track of Your Passwords Easily

EVERY NATION OUTLOOK WEB ACCESS (OWA) USER S GUIDE

Introduction to Programming Style

Repetition Structures

Condition-Controlled Loop. Condition-Controlled Loop. If Statement. Various Forms. Conditional-Controlled Loop. Loop Caution.

: Principles of Imperative Computation. Fall Assignment 5: Interfaces, Backtracking Search, Hash Tables

Practical 2: Using Minitab (not assessed, for practice only!)

ECE2049 Embedded Computing in Engineering Design. Lab #0 Introduction to the MSP430F5529 Launchpad-based Lab Board and Code Composer Studio

Lab 1 Introduction to R

Problem 1: Building and testing your own linked indexed list

There are several files including the start of a unit test and the method stubs in MindNumber.java. Here is a preview of what you will do:

CSE 143: Computer Programming II Winter 2019 HW6: AnagramSolver (due Thursday, Feb 28, :30pm)

Introduction to Advanced Features of PowerPoint 2010

Mehran Sahami Handout #5 CS 106A September 27, 2017 Downloading Eclipse

Computer Science E-119 Fall Problem Set 3. Due before lecture on Wednesday, October 31

CSE 143: Computer Programming II Spring 2015 HW7: 20 Questions (due Thursday, May 28, :30pm)

CMPSCI 187 / Spring 2015 Postfix Expression Evaluator

Option 1: Syllabus home page

This homework has an opportunity for substantial extra credit, which is described at the end of this document.

CMSC 201 Spring 2017 Project 1 Number Classifier

King Abdulaziz University Faculty of Computing and Information Technology Computer Science Department

Using Eclipse for Java. Using Eclipse for Java 1 / 1

Java/RealJ Troubleshooting Guide

Mehran Sahami Handout #5 CS 106A September 26, 2018 Downloading Eclipse

Initial Coding Guidelines

CS158 - Assignment 9 Faster Naive Bayes? Say it ain t so...

6.170 Laboratory in Software Engineering Java Style Guide. Overview. Descriptive names. Consistent indentation and spacing. Page 1 of 5.

Regis University CC&IS CS310 Data Structures Programming Assignment 2: Arrays and ArrayLists

CS1004: Intro to CS in Java, Spring 2005

Soda Machine Laboratory

be able to read, understand, and modify a program written by someone else utilize the Java Swing classes to implement a GUI

Using Eclipse Europa - A Tutorial

Chapter 29B: More about Strings Bradley Kjell (Revised 06/12/2008)

Jim Lambers ENERGY 211 / CME 211 Autumn Quarter Programming Project 2

Project 6: Assembler

CMSC 201 Spring 2018 Project 2 Battleship

Programming Project 5: NYPD Motor Vehicle Collisions Analysis

Stage 11 Array Practice With. Zip Code Encoding

CEN5501 Computer Networks Project Description

CS 201 Advanced Object-Oriented Programming Lab 6 - Sudoku, Part 2 Due: March 10/11, 11:30 PM

Hello World on the ATLYS Board. Building the Hardware

CS61BL: Data Structures & Programming Methodology Summer Project 1: Dots!

Note: This is a miniassignment and the grading is automated. If you do not submit it correctly, you will receive at most half credit.

Handout 4: Version Control Reference

Regis University CC&IS CS362 Data Structures

Lab 4: On Lists and Light Sabers

Project #3. Computer Science 2334 Fall Create a Graphical User Interface and import/export software for a Hurricane Database.

ADOBE DRIVE 4.2 USER GUIDE

CS 332, Spring 2016, Norman April 21, Lab: Router Lab

Transcription:

Text Generator Due Sunday, September 17, 2017 Objectives For this assignment, you will: Gain practice using Java generics Gain practice using the ArrayList and Association classes Gain proficiency in using objects to build more complex data structures Description In this assignment, we will use a basic technique in Artificial Intelligence to automatically generate text. You will write a program that reads in a piece of text and then uses that text as a basis to generate new text. The method for generating new text uses simple probability. First, we read in a piece of text word by word (we consider punctuation symbols to be words) keeping track of how often each three-word sequence (trigram) appears. For example, consider the following excerpt from Rudyard Kipling s poem If : If you can keep your head when all about you Are losing theirs and blaming it on you, If you can trust yourself when all men doubt you, But make allowance for their doubting too; If you can wait and not be tired by waiting In this excerpt, the trigrams are: if you can, you can keep, can keep your, keep your head, your head when, head when all, etc. Once we have counted all of the trigrams, we can compute the probability that a word w 3 will immediately follow two other words (w 1 w 2 ) using the following equation: p(w 3 w 1, w 2 ) = n 123 n 12 where n 123 is the number of times we observed the sequence w 1 w 2 w 3 and n 12 is the number of times we observed the sequence w 1 w 2. For example, let s compute the probability that the word can will immediately follow the words if you. In the excerpt above, the sequence if you can occurs three times. The sequence if you also occurs three times. So the probability is given by, p(can if you) = 3/3 = 1 The probability that any other word comes immediately after if you is 0. Consider another example. In the excerpt above, the words you can appear 3 times. The first time followed by keep, the second time by trust and the last time by wait. Thus, 1

p(keep you can) = 1/3 p(trust you can) = 1/3 p(wait you can) = 1/3 Again, the probability that any other word besides keep, trust, or wait appears after you can is zero. Once we have the text processed, and stored in a data structure that allows us to compute probabilities, we can generate new text. We first pick two words to start with (for example, the first two words in the input text). Given these two words, and the probabilities computed above, we use a random number generator to choose the next word. Continue this process until the new text is 400 words long (including punctuation). When printing the new text, please generate a new line after every 20 words so that you don t just generate one very long, unreadable, line. Note: There may be two words that were seen in the text but were never followed by any word. For example, in the excerpt of the Kipling poem above, the words by waiting occur once but are never followed by a word (since they are the last two words in the excerpt). In this case, you can either stop generating words or feel free to think up a more creative solution. Warning: This assignment is significantly harder than the last one and will take a lot of time. Please start early to save yourself lots of grief! Classes FreqList class We suggest that you write the FreqList class first. FreqList should contain an array list of Associations. ArrayList is part of the java.util package with documentation found in the Java 8 Javadoc pages linked to from the handout page on the course website. The Association class is part of Bailey s structure5 package. Javadoc for this class is also available from a different link on the handout page. An Association has both a key and a value. The key is a word and the value is the number of times the word occurs. When a word is added, if it already occurs in the array list then its value (i.e. its frequency) is incremented by 1. If it doesn t exist, add the word to the array list with a value (i.e. frequency) of 1. Your FreqList class should have an instance variable that keep tracks of the number of words added. This is equivalent to the sum of all the values (i.e. frequencies) of all the Associations in the array list. The FreqList class should also have a method with the following definition: public String get(double p) This method takes a double p as an input and returns a word from the array list. The input p must be between 0.0 and 1.0, otherwise the method should throw an IllegalArgumentException. How can we use p to generate a word? In our example above, the FreqList for the words you can would look like: {< keep, 1>, < trust, 1>, < wait, 1>}. The sum of all of the frequencies is 3. Thus, we will return keep whenever 0 p < 1/3, trust whenever 1/3 p < 2/3 and wait whenever 2/3 p 1. If the frequency list is empty, this method should return an empty string. 2

Write this class and test it thoroughly by adding a main method to make sure all methods work correctly. We suggest printing out a representation of the frequency list first to make sure that the table is correct before attempting to write or test the probabilistic get method. TextGenerator class The TextGenerator class should also contain an array list of Associations. Again, an Association has a key and a value. In this case, the key is a sequence of two words (e.g., you can ). The value is an object of type FreqList. We have included a class StringPair that can be used to represent a sequence of two words. Thus, the array list in TextGenerator should have interface type List<Association<StringPair, FreqList>>}. We have provided you with startup code in the main method of class TextGenerator that will pop up a dialog box to allow the user to choose a file containing the input text. We have also provided the headers of two methods for that class. The first method, enter(s1, s2, s3), will be used to build the table. The three String parameters are used to build the table (list) that will be used later to generate random text. Suppose the table you are building is named table and the input starts with This is the time for. Then you will call table.enter("this", "is", "the"), which should update the entry for the word pair of This and is to record that the is a possible word to follow the pair. More carefully, if there is no entry for that word pair, then it will create one for that pair with an empty frequency list, and then add the occurring once to that frequency list. If the word pair is already there, update the frequency list to record the added occurrence of the. Having done this, continue with table.enter("is","the","time"), then table.enter("the","time","for"), and continuing in the same way if there is more text. We have included some text files (ending with suffix txt ) in the assignment folder that you can use to test your program. We ve tried to pick files with sufficient repetition of triples. After the input has been processed to build the table, you should generate new text. You may start with a fixed pair of words or choose two words randomly. The method getnextword(s1,s2) uses the table generated from the input to return a randomly chosen word from the frequency list associated with the word pair of s1 and s2. (Of course it should choose the word using the probabilities as discussed earlier.) Generate and print a string of at least 400 words so that we can see how your program works. When printing the text, please generate a new line after every 20 words so that you don t just generate one very long, unreadable, line. Finally, we d like you to print the table of frequencies in some reasonable fashion so we can check to see if your table is correct on our test input. Here are some lines of output based on the lyrics for Dylan s Blowin in the wind : The table size is 122 <Association: <how,many>=frequency List: <Association: roads=1><association: seas=1> <Association: times=3><association: years=2><association: ears=1><association: deaths=1>> <Association: <many,roads>=frequency List: <Association: must=1>> <Association: <roads,must>=frequency List: <Association: a=1>> <Association: <must,a>=frequency List: <Association: man=2><association: white=1>>... Note that we cheated and manually wrapped the first line of output so that it is readable. Your output need not wrap. Note that most of this output comes from the tostring methods of Association and ArrayList, though we did some extra work to print each entry in the array list on a separate line. 3

This output indicates that after the word pair, how many, the words roads, seas, times, years, ears, and deaths each occurred once. Whereas after many roads, only must occurred and it only occurred once. Warning: As a general rule it is good to be as specific as possible with import statements. Thus, when you import the package with the class Random, use import java.util.random; import java.util.arraylist; import structure5.association; If your program included import java.util.*; along with import structure5.* the program might get confused as to which version of classes it should use as there are several classes in java.util that have the same names as those in structure5. It is best to be safe even when we don t have conflicts. WordStream class This class is already implemented for you. It processes a text file and breaks up the input into separate words that can be requested using the method nexttoken. In the main method of the TextGenerator class, there is an object of type WordStream called ws. You can get successive words from ws by calling ws.nexttoken() repeatedly. Before getting a new word, always check that there is a word available by evaluating ws.hasmoretokens(), which will return true if there are more words available. If you call nexttoken() when there are no more words available (i.e., you ve exhausted the input), then it will throw an IndexOutOfBoundsException. StringPair class This class is already implemented for you. This class provides objects that consist of a pair of strings. These should be used as the keys to your Associations. Pair class This class is already implemented for you. This is a generic class that StringPair extends by specializing the type variables to both be Strings. You should not directly use this class in your program. Instead use the more compact name StringPair from the more specialized class defined above. We provided this class only to show how easy it is to define a generic class that can be instantiated in many ways. Association class This class is part of Bailey s structure5 package. There is one piece of information about the Association class that we did not highlight in class that you will likely find very useful. The equals method in Association only compares key values. That is if two objects from class Association are equal according to the method, then their keys are the same, but their values may be different. Thus, to find out if an association with a given key is in an arraylist (and where it is), just create a new association with that key and use the indexof method to determine its index in the list. If -1 is returned, then it is not there, whereas if a non-negative integer is returned then you get the index of an association with that key in the arraylist. Getting Started 1. Think carefully about the design of this program before sitting down at a computer. What other methods and instance variables might the FreqList class need? For the get method in the FreqList 4

class, take a piece of paper and create different examples of frequency lists. For each example, work out what the get method would return for different values of p (e.g. if the frequency list contained 4 words that each occurred once and p = 0.45, what word would be returned?) 2. Using Finder (or scp if youre working remotely), copy the starter directory /common/cs/cs062/assignments/assignmen directly into your Eclipse workspace. 3. Next in Eclipse, select File Import and target the folder you just copied. 4. Finally, add the BAILEY variable to the projects Java Build Path (go to Project Properties with the project selected and open the Java Build Path tab). 5. Refresh the project in Eclipse. 6. You are now ready to get started! We recommend starting with the FreqList class. Again, try and develop incrementally. That is, get one small piece working and then move on to another piece. Add a main method to the FreqList class in order to test and ensure the correctness of your code. Grading You will be graded based on the following criteria: criterion points FreqList 4 TextGenerator 3 Prints FreqList 2 Prints at least 400 words of generated text 2 Appropriate comments (including JavaDoc) 2 Appropriate use of data structures 2 Style and formatting 2 General correctness 2 submitted correctly 1 extra credit 1 NOTE: Code that does not compile will not be accepted! Make sure that your code compiles before submitting it. Submitting Your Work 1. Test your classes thoroughly before turning in your project. Feel free to use the text files in the assignment folder to test your program. While you may not compare code with other students, you may compare the contents of the tables you generate. 2. Make sure you have added appropriate Javadoc comments to your code. 3. Export your project from Eclipse and submit it using the instructions at http://www.cs.pomona.edu/ classes/cs062/handouts/eclipse_submission.pdf. Please follow these very carefully. If you do not, your program will NOT be submitted correctly and you will not receive credit for your assignment (and you and we will all be very sad!). Don t forget to fill out the asg02.json file! In particular, don t forget to set the ec field to true if you did extra credit. 5

Extra Credit Opportunities There are many ways in which you might extend such a program. For example, as described, word pairs which never appeared in the input will never appear in the output. Is there a way you could introduce a bit of mutation to allow new pairs to appear? These trigrams of words are used in natural language processing in order to categorize kinds of articles and their authors. Think about how you might use tables like those given here to help determine if two articles are written by the same author. In case you do attempt an extra credit extension, don t forget to set the ec field to true in the asg02.json file so that we know you did it. You should also indicate in comments wherever you attempted extra credit. 6