Hands on Assignment 1

Similar documents
// class variable that gives the path where my text files are public static final String path = "C:\\java\\sampledir\\PS10"

COMP 321: Introduction to Computer Systems

CSCI 1100L: Topics in Computing Lab Lab 11: Programming with Scratch

CS159 - Assignment 2b

Using the Computer for Essays

This is a combination of a programming assignment and ungraded exercises

Dictionary Wars CSC 190. March 14, Learning Objectives 2. 2 Swiper no swiping!... and pre-introduction 2

Lab 1 Introduction to UNIX and C

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 3: SEP. 13TH INSTRUCTOR: JIAYIN WANG

: Principles of Imperative Computation. Fall Assignment 5: Interfaces, Backtracking Search, Hash Tables

find starting-directory -name filename -user username

Due: Tuesday 29 November by 11:00pm Worth: 8%

Hash Table and Hashing

Due: March 8, 11:59pm. Project 1

Due: 9 February 2017 at 1159pm (2359, Pacific Standard Time)

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

CSCI 305 Concepts of Programming Languages. Programming Lab 1

Final Programming Project

Problem Set 4. Problem 4-1. [35 points] Hash Functions and Load

Introduction to Algorithms: Massachusetts Institute of Technology 7 October, 2011 Professors Erik Demaine and Srini Devadas Problem Set 4

CS 142 Style Guide Grading and Details

Laboratory Exercise #0

CSCI-UA /002, Fall 2012 RK Lab: Approximiate document matching Due: Sep 26 11:59PM

UNIVERSITY OF NEBRASKA AT OMAHA Computer Science 4500/8506 Operating Systems Fall Programming Assignment 1 (updated 9/16/2017)

Preprocessor Directives

Part I: Written Problems

Lab 03 - x86-64: atoi

Programming Assignment #4

CSE373 Fall 2013, Second Midterm Examination November 15, 2013

Network Administration/System Administration (NTU CSIE, Spring 2018) Homework #1. Homework #1

CS 241 Data Organization using C

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

CMSC 201 Spring 2018 Project 3 Minesweeper

Project #1 Exceptions and Simple System Calls

CpSc 1111 Lab 4 Part a Flow Control, Branching, and Formatting

CS 2110 Summer 2011: Assignment 2 Boggle

Program Assignment 2

CS31 Discussion 1E. Jie(Jay) Wang Week1 Sept. 30

The print queue was too long. The print queue is always too long shortly before assignments are due. Print your documentation

Important Project Dates

Word 97: Increasing Efficiency

Microsoft Windows PowerShell v2 For Administrators

(A) Good C++ Coding Style 1

CS 1110, LAB 10: ASSERTIONS AND WHILE-LOOPS 1. Preliminaries

Performing Matrix Operations on the TI-83/84

CSE 5A Introduction to Programming I (C) Homework 4

Perl Basics. Structure, Style, and Documentation

CS261: HOMEWORK 2 Due 04/13/2012, at 2pm

Homework 8: Matrices Due: 11:59 PM, Oct 30, 2018

proj 3B intro to multi-page layout & interactive pdf

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Fall 2015

APPM 2360 Lab #2: Facial Recognition

TOPIC 2 INTRODUCTION TO JAVA AND DR JAVA

CS125 : Introduction to Computer Science. Lecture Notes #4 Type Checking, Input/Output, and Programming Style

CS 113: Introduction to

CS 241 Data Organization. August 21, 2018

CMSC 201 Fall 2016 Lab 09 Advanced Debugging

Lab #0 Getting Started Due In Your Lab, August 25, 2004

These bit positions are represented by numerical values, as defined in list.h.

What a lot of folks might call the class but as we ll see later, it s not entirely accurate.

Eat (we provide) link. Eater. Goal: Eater(Self) == Self()

CS61BL: Data Structures & Programming Methodology Summer Project 1: Dots!

Assignment 5: Due Friday April 8 at 6pm Clarification and typo correction, Sunday Apr 3 3:45pm

Programming Project 1

CSCI544, Fall 2016: Assignment 2

Variables and. What Really Happens When You Run hello_world.py

CS 134 Programming Exercise 2:

CMSC 311 Spring 2010 Lab 1: Bit Slinging

Slide Set 5. for ENCM 339 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

Style and Submission Guide

COMP 412, Fall 2018 Lab 1: A Front End for ILOC

Homework: Simple functions and conditionals

printf( Please enter another number: ); scanf( %d, &num2);

Important Project Dates

COMP Assignment 1

Assignment 3, Due October 4

Lecture 17: Hash Tables, Maps, Finish Linked Lists

CS350 : Operating Systems. General Assignment Information

Understanding PowerPoint s Text Capabilities

COM110: Lab 2 Numeric expressions, strings, and some file processing (chapters 3 and 5)

MATLAB Demo. Preliminaries and Getting Started with Matlab

9.2 Linux Essentials Exam Objectives

CS 1301 Fall 2008 Lab 2 Introduction to UNIX

EE 422C HW 6 Multithreaded Programming

Word Processing Basics Using Microsoft Word

An Introduction to Subtyping

CS 2505 Fall 2013 Data Lab: Manipulating Bits Assigned: November 20 Due: Friday December 13, 11:59PM Ends: Friday December 13, 11:59PM

Deadline. Purpose. How to submit. Important notes. CS Homework 9. CS Homework 9 p :59 pm on Friday, April 7, 2017

Project #1 Computer Science 2334 Fall 2008

CSSE 304 Assignment #13 (interpreter milestone #1) Updated for Fall, 2018

Section 2. Sending s

6.170 Laboratory in Software Engineering Java Style Guide. Overview. Descriptive names. Consistent indentation and spacing. Page 1 of 5.

CSC209. Software Tools and Systems Programming.

CSE 143: Computer Programming II Spring 2015 HW7: 20 Questions (due Thursday, May 28, :30pm)

Exercises: Instructions and Advice

CS451 - Assignment 3 Perceptron Learning Algorithm

Welcome to CS 135 (Winter 2018)

Lecture 13 Hash Dictionaries

Note: This is a miniassignment and the grading is automated. If you do not submit it correctly, you will receive at most half credit.

CSE 361S Intro to Systems Software Lab #4

Transcription:

Hands on Assignment 1 CSci 2021-10, Fall 2018. Released Sept 10, 2018. Due Sept 24, 2018 at 11:55 PM Introduction Your task for this assignment is to build a command-line spell-checking program. You may be more familiar with spell checking as a feature inside an editor, but what you will build is a separate program that looks through a text file and reports any misspelled words in the file. Additionally, when a word is misspelled, the program should check whether the word is similar to a correct word, and if so suggest the correction. To know what words are correct, the program will take, as another argument, a file containing a list of the legal English words. We ll call this a dictionary file, though it s just the words and not the definitions. To get you started, we ve written the parts of the code that read the input files and identify individual words. Your job is to write the rest. In particular, any places where you see TODO in a comment are places that you should add code. There is one unusual feature that the program should have, according to a request we are passing on from the marketing department. In a (perhaps misguided) plan to differentiate the spell checker from its competition and show that it is prepared for the Internet age, this will be the only spell checker that supports the @ (at) sign in English words. Specifically the @ sign should be treated as a variation of the lowercase letter a. For instance words like em@il and @tmosphere should be counted as correct, just like the outdated 20th-century spellings email and atmosphere. This is the sort of program that you could write in any general-purpose language, and it doesn t depend on any 2021-specific concepts; in fact it might seem like it could have come from CSci 1933. That s on purpose: we ve doing this project first to give you some practice writing programs in C in our Linux environment, to help you prepare for later projects where you ll have to work with C while also exploring new concepts. We ask that you write your program to store the word-list in memory using a hash table data structure. Specifically, use what is called a separate-chaining hash table, one in which all the values that hash to the same index are stored in a linked list. This will require an efficient hash function to index each word in the dictionary. You will populate the hashtable with words found in the dictionary. Then, you will write a function to check if the words in an arbitrary input file are found in the dictionary. Finally, write a recommender function that suggests the misspelled word s correct spelling. The program will be called spellcheck and have the following usage: 1

spellcheck [-d <debug_flags>] <dictionary_file> <input_file>. The angle brackets aren t part of the syntax; they indicate that while spellcheck and -d are written as is, the arguments described as dictionary_file and input_file need to be replaced by appropriate file names. The square brackets mean that the -d option and its flags argument are optional; but if -d is given, the next argument needs to be an integer representing debugging flags. The program should produce output to the standard output (e.g., using printf) with the following format: <misspelledword1>: <suggestedword1> <misspelledword2>: <suggestedword2> <misspelledword3>?... <misspelledwordn>: <suggestedwordn> In other words, there should be one line per misspelled word. Each incorrect word should appear only once. If your program can find a correct word that is similar to the incorrect word, print it after the incorrect word, separated by a space and a colon. If your program can t find a suggestion, just print a question mark after the incorrect word. The dictionary_file has this specific format: There is one word per line, each followed by a newline. Words consist only of English letters A-Z or a-z, or apostrophes, and are less than 100 letters. Your program should be able to efficiently handle a dictionary on the order of 100,000 lines. Words that are capitalized or all-capital-letters in the dictionary represent words that need to be capitalized or all-caps to be correct. For instance America not america, and USA not Usa or usa. If multiple capitalizations of a word appear in the dictionary, the ones that require more capital letters will come first (e.g., Baker the last name before baker the occupation). The input_file should be a file containing ASCII text; beyond that it doesn t have any particular format. Your program should also be able to deal efficiently with a large input file. For instance checking a 10,000 word input file against a 100,000 line dictionary to find 100 misspelled words should require only a fraction of a second. On the course web site near where you found this handout, you ll be able to download a compressed tar file ha1.tar.gz containing the files for this assignment, such as a sample of a dictionary file and some input text files, and a framework spellcheck.c containing the code we ve already written for you. Copy the tar file to your home directory and then expand it with this command: 2

tar -xzvf ha1.tar.gz Part A - Data Structure Your first step in building this program is to construct a data structure that will be used to represent a dictionary. The dictionary will begin as a text file containing a list of correctly spelled words. Your program should read this text file and insert each word into the hashtable. If you ve taken 1913, 1933, or a similar class, you should already be familiar with how hashtables work. If not, you can look at an old textbook or an web resource like Wikipedia for a review, but remember that such outside sources should only be helping you with concepts: you need to write your own code. The file I/O is already written and taken care of, so you will only be responsible for: 1. Creating the hashtable. 2. Creating the hash function. 3. Inserting each word in the dictionary into the hashtable. Your hashtable should: 1. Have an appropriate number of bins to hold around 100,000 entries. 2. Have a well-written hash function that minimizes collisions. 3. Manage collisions properly when they happen. 4. Insert in O(1) time. 5. Find in O(1) expected time. (The time to find an item will likely depend on whether there are collisions. But the time should be O(c) if there are c collisions, and the average number of collisions should be small.) Avoid having too many collisions in your hashtable. This will slow down the performance of your program and defeats the purpose of using a hashtable. As a rough estimate, if you insert as many entries into the hashtable as it hash buckets, more than half of the buckets should have at least one entry. If only a small fraction or even only one bucket is used, that s a good sign you have a poor hash function. Because the number of words in English is not growing much, it is OK for the number of buckets in your hashtable to be fixed, as long as you choose a large enough fixed value. While avoiding collisions, it is also necessary that your hash function be compatible with the lookup operations you want to do: words can occur capitalized even if they are not capitalized in the dictionary. So think about what this implies for your hash function. 3

Part B - Spell Check The second part of this program is the actual spell-check functionality. Given an input text file, your program will need to read every word in the input file (this part is done for you) and look it up in the dictionary. If the word does not exist in the dictionary, either because it is misspelled, or the dictionary is not comprehensive enough, the word should be printed to the console in the format specified above. Here are some simple rules to follow: 1. If a word doesn t appear in the dictionary, it is spelled wrong. 2. If a word is capitalized in the dictionary, it must be capitalized in the same way to be spelled correctly. To be more specific, any letter that is capitalized in the dictionary must be capitalized. It is OK if even more letters are capitalized, like if part of a document is written IN ALL CAPITAL LETTERS. 3. Otherwise (if the word is lowercase in the dictionary), capitalization does not matter. One specific corner case you will need to add some code for has to do with the ' character, which can represent either a single quote or an apostrophe, depending on where it appears. We want to count an apostrophe as part of a word, since the correct placement of apostrophes is an important part of the spelling of contractions, for instance. However single quotes shouldn t be part of a word. The word recognition code that we have written for you always treats the ' character as part of a word; but if a word appears in single quotes without whitespace like 'this', the word your program should look up in the dictionary is this, not 'this'. It is recommended to test as many edge cases as you can think of. This program should be robust and handle many different type of spelling mistakes. Use the provided dictionary as a reference for your tests. This will be the main list of words used in grading, and it is indicative of the size and format of dictionary your program should support well. Part C - Word Suggestion The final part of any good spell check software is to suggest possible correct words. You are asked to implement a simple version of this feature that outputs one reasonable suggestion. The suggestion must be a word that already exists in the dictionary. We are defining a reasonable suggestion as a new word that differs from the misspelled word by one character. This includes: 1. Having one extra character somewhere in the word. 2. Having one fewer character somewhere in the word. 3. Having one substituted character somewhere in the word. 4

For example, suppose a word in the input text was fumd. This is obviously misspelled and should be caught by the spell check. One reasonable suggestion would be the word fund since there is one substituted character. However, an unreasonable suggestion would be the word stop since it has nothing in common with the original word. Your reasonable suggestion should be printed next to the misspelled word in the output. See the output structure above for more detail about the expected output of your program. In many cases, there may exist more than one reasonable suggestion. Depending on how this portion of your program is written, it may find one replacement before another. You are only required to output one such suggestion. If no suggestions are found, print a question mark (?) after the misspelled word as specified above. Testing The dictionary that will be used most often for grading is provided (the file is called dictionary). Make sure that you have tested your program with this dictionary at least a few times. We also suggest that you create your own, smaller dictionaries to use while developing your hashtable and testing. It will be easier to debug 10 or 100 elements than 100,000. The same applies for the input file. Try to think of corner cases and test these with different inputs. The inputs you will be graded against will be very large (some including tens of thousands of words), so do make sure your structures and algorithms will scale to that size. A mechanism for setting the debug level of printouts is built into the program. Feel free to add your own debug levels to help you in your testing. The debug level is an integer, so you can use different values of the variable to select different extra outputs or other debugging code. Expectations This project must be done individually. It is not a lab activity, and it is meant to test your C programming abilities, not the skill of your friend or the code written by other people and posted online. That being said, feel free to consult the TAs if you need more clarification or guidance on the assignment. You can find a list of the TA office hours on the course website under the Staff tab. We also expect that you begin this assignment early. All of the Hands-on- Assignments in this class are designed to take between 10 and 20 hours to complete. One sure fire way to do poorly on these assignments is to start at the 5

last minute and not have enough time to fully think through a solution or ask any questions that you have. Grading The project will be graded in two ways. 1. Automated grading scripts to test the accuracy of the output and the efficiency of the program. 2. Manual grading by TAs reading your code. The grade breakdown will be: Data structure and populating the dictionary: 30% Correctly identifying misspelled words: 20% Reasonable word suggestion: 30% Code clarity, style, and comments: 20% Note A portion of the grading will be automated, so it is important that your program compiles correctly as submitted, takes its inputs in the format specified above, and produces its output in the format specified above. You will get only partial credit if you have the right concepts, but your program does not do what it is supposed to do. Submitting This assignment is due on Monday, Sept 24 at 11:55 pm. Please review the course syllabus for the late submission policy. If you have any questions about the submission process or deadline, please email the TAs at csci2021f18-010- help@umn.edu Please submit your solution to this lab on Moodle. We ask that you only submit one.c file that contains all of your code. Please name the file: spellcheck_<x500>.c Where <X500> is the part of your UMN email address before the @umn.edu, also the same as your CSE Labs username and the sometimes called an Internet ID or X.500 identifier. For example, if your email address is goldy001@umn.edu, your filename should look like: spellcheck_goldy001.c 6