File Navigation and Text Parsing in Java

Similar documents
File Navigation and Text Parsing in Java

Geographic Information System

CS 3114 Data Structures and Algorithms DRAFT Project 2: BST Generic

Structured Types, Files and Parsing, Indexing

For storage efficiency, longitude and latitude values are often represented in DMS format. For McBryde Hall:

Geographic Information System version 1.0

Geographic Information System

CS 2604 Minor Project 1 DRAFT Fall 2000

CS 1044 Program 6 Summer I dimension ??????

gcc o driver std=c99 -Wall driver.c everynth.c

CS 2604 Minor Project 1 Summer 2000

Each line will contain a string ("even" or "odd"), followed by one or more spaces, followed by a nonnegative integer.

You will provide an implementation for a test driver and for a C function that satisfies the conditions stated in the header comment:

struct _Rational { int64_t Top; // numerator int64_t Bottom; // denominator }; typedef struct _Rational Rational;

Decision Logic: if, if else, switch, Boolean conditions and variables

Pointer Accesses to Memory and Bitwise Manipulation

PR quadtree. public class prquadtree< T extends Compare2D<? super T> > {

gcc o driver std=c99 -Wall driver.c bigmesa.c

Given that much information about two such rectangles, it is possible to determine whether they intersect.

CS 2704 Project 1 Spring 2001

Here is a C function that will print a selected block of bytes from such a memory block, using an array-based view of the necessary logic:

A rectangle in the xy-plane, whose sides are parallel to the coordinate axes can be fully specified by giving the coordinates of two opposite corners:

Pointer Casts and Data Accesses

Programming Standards: You must conform to good programming/documentation standards. Some specifics:

Pointer Accesses to Memory and Bitwise Manipulation

Pointer Accesses to Memory and Bitwise Manipulation

Creating a String Data Type in C

// Initially NULL, points to the dynamically allocated array of bytes. uint8_t *data;

CS3114 (Fall 2013) PROGRAMMING ASSIGNMENT #2 Due Tuesday, October 11:00 PM for 100 points Due Monday, October 11:00 PM for 10 point bonus

CS 2704 Project 3 Spring 2000

Both parts center on the concept of a "mesa", and make use of the following data type:

CS 1044 Program 2 Spring 2002

3. When you process a largest recent earthquake query, you should print out:

The assignment requires solving a matrix access problem using only pointers to access the array elements, and introduces the use of struct data types.

CS 2604 Minor Project 3 Movie Recommender System Fall Braveheart Braveheart. The Patriot

CS 1044 Project 1 Fall 2011

This homework is due by 11:59:59 PM on Thursday, October 12, 2017.

CS 3114 Data Structures and Algorithms DRAFT Minor Project 3: PR Quadtree

Multiple-Key Indexing

CS 1044 Project 2 Spring 2003

CS 2604 Minor Project 3 DRAFT Summer 2000

Programming Assignments

The Program Specification:

CS2304 Spring 2014 Project 3

Geographic Information System

COMP 412, Fall 2018 Lab 1: A Front End for ILOC

CS ) PROGRAMMING ASSIGNMENT 11:00 PM 11:00 PM

For this assignment, you will implement a collection of C functions to support a classic data encoding scheme.

Here is a C function that will print a selected block of bytes from such a memory block, using an array-based view of the necessary logic:

Fundamental Concepts: array of structures, string objects, searching and sorting. Static Inventory Maintenance Program

A Capacity: 10 Usage: 4 Data:

Do not start the test until instructed to do so!

CS 2704 Project 2: Elevator Simulation Fall 1999

CSCI 4210 Operating Systems CSCI 6140 Computer Operating Systems Homework 4 (document version 1.1) Sockets and Server Programming using C

Assignment Dropbox. Overview

This is a combination of a programming assignment and ungraded exercises

Invoice Program with Arrays and Structures

a f b e c d Figure 1 Figure 2 Figure 3

CS 1044 Project 5 Fall 2009

Lab 03 - x86-64: atoi

Project #1: Tracing, System Calls, and Processes

Laboratory Assignment #3. Extending scull, a char pseudo-device

Simple C Dynamic Data Structure

Laboratory Assignment #3. Extending scull, a char pseudo-device. Summary: Objectives: Tasks:

CSCI 4210 Operating Systems CSCI 6140 Computer Operating Systems Project 1 (document version 1.3) Process Simulation Framework

Engineering Manual File Format Specification Version: EM15-P Pipeline Specification U.S. Army Corps of Engineers December 01, 2015

Data Structures and Algorithms

Programming Project 5: NYPD Motor Vehicle Collisions Analysis

CS 1803 Pair Homework 4 Greedy Scheduler (Part I) Due: Wednesday, September 29th, before 6 PM Out of 100 points

Accessing Data in Memory

Hands on Assignment 1

1 The Var Shell (vsh)

CIT 590 Homework 5 HTML Resumes

CS 2505 Fall 2018 Data Lab: Data and Bitwise Operations Assigned: November 1 Due: Friday November 30, 23:59 Ends: Friday November 30, 23:59

CSE 401/M501 18au Midterm Exam 11/2/18. Name ID #

CSSE2002/7023 The University of Queensland

CSE 131 Introduction to Computer Science Fall Exam I

CSCI544, Fall 2016: Assignment 2

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 3: SEP. 13TH INSTRUCTOR: JIAYIN WANG

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

CS 2505 Computer Organization I Test 1

CMSC 201 Fall 2016 Lab 09 Advanced Debugging

UNIVERSITY OF NEBRASKA AT OMAHA Computer Science 4500/8506 Operating Systems Summer 2016 Programming Assignment 1 Introduction The purpose of this

Address Management User Guide. PowerSchool 8.x Student Information System

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

CMSC 412 Project 1: Keyboard and Screen Drivers

Address Management User Guide. PowerSchool 6.0 Student Information System

CS 2505 Computer Organization I

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

Fundamentals of Programming Session 25

Documentation Requirements Computer Science 2334 Spring 2016

CS 470 Operating Systems Spring 2013 Shell Project

The following steps will create an Eclipse project containing source code for Problem Set 1:

CS 2505 Computer Organization I Test 1

Starting to Program in C++ (Basics & I/O)

CS 2505 Computer Organization I Test 1. Do not start the test until instructed to do so!

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

Protein Sequence Database

UNIVERSITY OF NEBRASKA AT OMAHA Computer Science 4500/8506 Operating Systems Fall Programming Assignment 1 (updated 9/16/2017)

In this lab we will practice creating, throwing and handling exceptions.

Transcription:

File Navigation and Text Parsing in Java This assignment involves implementing a smallish Java program that performs some basic file parsing and navigation tasks, and parsing of character strings. The program will deal with two input files. The first input file, whose name will be supplied from the command line, contains a collection of data records pertaining to geographical features, obtained from the website for the USGS Board on Geographic Names (geonames.usgs.gov). The file begins with a descriptive header line, followed by a sequence of GIS records, one per line, which contain the following fields in the indicated order: Name Type Length/ Decimals Feature ID Integer 10 Feature Name Feature Class String 120 Figure 1: Geographic Data Record Format Short Description Permanent, unique feature record identifier and official feature name String 50 See Figure 3 later in this specification State Alpha String 2 State Numeric County Name County Numeric Elevation (meters) Elevation (feet) String 2 String 100 String 3 String 7 String 8 Real Number 11/7 Real Number 12/7 String 7 String 8 Real Number 11/7 Real Number 12/7 The unique two letter alphabetic code and the unique two number code for a US State The name and unique three number code for a county or county equivalent The official feature location -degrees/minutes/seconds -decimal degrees. Note: Records showing "Unknown" and zeros for the latitude and longitude and decimal fields, respectively, indicate that the coordinates of the feature are unknown. They are recorded in the database as zeros to satisfy the format requirements of a numerical data type. They are not errors and do not reference the actual geographic coordinates at 0 latitude, 0 longitude. coordinates of linear feature only (Class = Stream, Valley, Arroyo) -degrees/minutes/seconds -decimal degrees. Note: Records showing "Unknown" and zeros for the latitude and longitude and decimal fields, respectively, indicate that the coordinates of the feature are unknown. They are recorded in the database as zeros to satisfy the format requirements of a numerical data type. They are not errors and do not reference the actual geographic coordinates at 0 latitude, 0 longitude. Integer 5 Elevation in meters above (-below) sea level of the surface at the primary coordinates Integer 6 Elevation in feet above (-below) sea level of the surface at the primary coordinates Version 5.00 This is a purely individual assignment! 1

Map Name String 100 Name of USGS base series topographic map containing the primary coordinates. Date Created String Date Edited String Notes: The date the feature was initially committed to the database. The date any attribute of an existing feature was last edited. See https:geonames.usgs.gov/domestic/states_fileformat.htm for the full field descriptions. The type specifications used here have been modified from the source (URL above) to better reflect the realities of your programming environment. and longitude may be expressed in (degrees/minutes/seconds, 0820830W) format, or (real number, -82.1417975) format. In format, latitude will always be expressed using 6 digits followed by a single character specifying the hemisphere, and longitude will always be expressed using 7 digits followed by a hemisphere designator. Although some fields are mandatory, some may be omitted altogether. Best practice is to treat every field as if it may be left unspecified. Certain fields are necessary in order to index a record: the feature name and the primary latitude and primary longitude. If a record omits any of those fields, you may discard the record, or index it as far as possible. In the GIS record file, each record will occur on a single line, and the fields will be separated by pipe (' ') symbols. Empty fields will be indicated by a pair of pipe symbols with no characters between them. See the posted VA_Monterey.txt file for many examples. GIS record files are guaranteed to conform to the syntax described above, so there is no explicit requirement that you validate the files. On the other hand, some error-checking during parsing may help you detect errors in your parsing logic. A file can be thought of as a sequence of bytes, each at a unique offset from the beginning of the file, just like the cells of an array. So, each GIS record begins at a unique offset from the beginning of the file. Line Termination Each line of a text file ends with a particular marker (known as the line terminator). In MS-DOS/Windows file systems, the line terminator is a sequence of two ASCII characters (CR + LF, 0X0D0A). In Unix systems, the line terminator is a single ASCII character (LF). Other systems may use other line termination conventions. Why should you care? Which line termination is used has an effect on the file offsets for all but the first record in the data file. As long as we re all testing with files that use the same line termination, we should all get the same file offsets. But if you change the file format (of the posted data files) to use different line termination, you will get different file offsets than are shown in the posted log files. Most good text editors will tell you what line termination is used in an opened file, and also let you change the line termination scheme. All that being said, as project is auto-graded, all input files will have the same line termination, and the grading of correctness will depend on whether you report the correct search results, not on the file offsets you report. Version 5.00 This is a purely individual assignment! 2

Figure 2: Sample Geographic Data Records Note that some record fields are optional, and that when there is no given value for a field, there are still delimiter symbols for it. Also, some of the lines are "wrapped" to fit into the text box; lines are never "wrapped" in the actual data files. FEATURE_ID FEATURE_NAME FEATURE_CLASS STATE_ALPHA STATE_NUMERIC COUNTY_NAME COUNTY_NUMERIC PRIMARY_LAT_ PRIM_LONG_ PRIM_LAT_ PRIM_LON G_ SOURCE_LAT_ SOURCE_LONG_ SOURCE_LAT_ SOURCE_LONG_ ELEV_IN_M ELEV_IN_FT MAP_NAME DATE_CREATED DATE_EDITED 1479116 Monterey Elementary School School VA 51 Roanoke (city) 770 371906N 0795608W 37.3183753-79.9355857 323 1060 Roanoke 09/28/1979 09/15/2010 1481345 Asbury Church Church VA 51 Highland 091 382607N 0793312W 38.4353981-79.5533807 818 2684 Monterey 09/28/1979 1481852 Blue Grass Populated Place VA 51 Highland 091 383000N 0793259W 38.5001188-79.5497702 777 2549 Monterey 09/28/1979 1481878 Bluegrass Valley Valley VA 51 Highland 091 382953N 0793222W 38.4981745-79.539492 382601N 0793800W 38.4337309-79.6333833 759 2490 Monterey 09/28/1979 1482110 Buck Hill Summit VA 51 Highland 091 381902N 0793358W 38.3173452-79.5661577 1003 3291 Monterey SE 09/28/1979 1482176 Burners Run Stream VA 51 Highland 091 382509N 0793409W 38.4192873-79.5692144 382531N 0793538W 38.4252778-79.5938889 848 2782 Monterey 09/28/1979 1482324 Mount Carlyle Summit VA 51 Highland 091 381556N 0793353W 38.2656799-79.5647682 698 2290 Monterey SE 09/28/1979 1482434 Central Church Church VA 51 Highland 091 382953N 0793323W 38.4981744-79.5564371 773 2536 Monterey 09/28/1979 1482557 Claylick Hollow Valley VA 51 Highland 091 381613N 0793238W 38.2704021-79.5439343 381733N 0793324W 38.2925-79.5566667 573 1880 Monterey SE 09/28/1979 1482785 Crab Run Stream VA 51 Highland 091 381707N 0793144W 38.2854018-79.528934 381903N 0793415W 38.3175-79.5708333 579 1900 Monterey SE 09/28/1979 1482950 Davis Run Stream VA 51 Highland 091 381824N 0793053W 38.3067903-79.5147671 382057N 0793505W 38.3491667-79.5847222 601 1972 Monterey SE 09/28/1979 1483281 Elk Run Stream VA 51 Highland 091 382936N 0793153W 38.4934524-79.5314362 383121N 0793056W 38.5226185-79.5156027 757 2484 Monterey 09/28/1979 1483492 Forks of Waters Locale VA 51 Highland 091 382856N 0793031W 38.4823417-79.5086575 705 2313 Monterey 09/28/1979 1483527 Frank Run Stream VA 51 Highland 091 382953N 0793310W 38.4981744-79.5528258 383304N 0793341W 38.5512285-79.5614381 780 2559 Monterey 09/28/1979 1483647 Ginseng Mountain Summit VA 51 Highland 091 382850N 0793139W 38.480675-79.527547 978 3209 Monterey 09/28/1979 1483860 Gulf Mountain Summit VA 51 Highland 091 382940N 0793103W 38.4945636-79.5175468 1006 3300 Monterey 09/28/1979 1483916 Hamilton Chapel Church VA 51 Highland 091 381740N 0793707W 38.2945677-79.6186591 823 2700 Monterey SE 09/28/1979 1484097 Highland High School School VA 51 Highland 091 382426N 0793444W 38.4071387-79.5789333 879 2884 Monterey 09/28/1979 09/15/2010 1484099 Highland Wildlife Management Area Park VA 51 Highland 091 381905N 0793439W 38.3181785-79.577547 954 3130 Monterey SE 09/28/1979... Version 5.00 This is a purely individual assignment! 3

Your program may be invoked from the command line, in either of the following ways: java GISParser -offsets <GIS record file name> java GISParser <GIS record file name> <commands file name> Invoked in the first way, your program will parse the given GIS record file and write, to a file named offsets.txt, the file offset and Feature Name field for each of the records found in the file, listed in the order the records occur in the input file. Invoked in the second way, your program will process the given commands file (and NOT write out a list file offsets and Feature Name values), and write its output to a file named results.txt. Each command must be echoed into the output file, on a line by itself, numbered as shown in the posted log files. Following each echoed command, your program will write one line reporting the results of carrying out that command, as described below. The first input file will be a GIS record file, as described above. The second input file, whose name will also be supplied on the command line, contains a sequence of search commands that must be processed. The only types of search that must be supported are: show_name<tab><offset> If the offset is valid (see below), write the Feature Name field for the record that occurs at that offset show_latitude<tab><offset> show_longitude<tab><offset> If the offset is valid (see below), write the primary latitude or primary longitude, as specified, for the record that occurs at that offset. The specified fields should be separated by whitespace (your choice as to what). If the specified field is not included in the record, write "Coordinate is not given". The and fields are given in the GIS records in two formats, and. You should parse the version. Your output must reformat the latitude or longitude in a human-friendly manner. Specifically: 1090224W 109d 2m 24s West Note that latitude values will always consist of 6 decimal digits, followed by a hemisphere designator ('W' or 'E'), and longitude values will always consist of 7 decimal digits, followed by a hemisphere designator ('S' or 'N'). The Java String and Scanner classes both have useful methods for breaking out the relevant parts. show_elevation<tab><offset> If the offset is valid (see below), write the elevation in feet field for the record that occurs at that offset. One complication is that the elevation fields are optional; if the elevation is not given, write "Elevation is not given". What about invalid offsets? If the offset is not positive, write the error message Offset is not positive. If the offset is larger than the length of the data file, write the error message Offset is too large. If the offset is non-negative but does not correspond to the first character on a line of the file that contains a GIS record, write the error message Offset is unaligned. The only other command is: quit<tab> Cease processing the commands file, log the message Exiting, close all files and exit the program. Each command will occur on a line by itself. Lines beginning with a semi-colon character ';' are comments and should be ignored. Blank lines are possible. The command file is guaranteed to conform to this specification, so you do not need to worry about error-checking when reading it. See the posted command files for examples. Version 5.00 This is a purely individual assignment! 4

Sample input and output files will be supplied on the course website. Under no circumstances may your program store more than one complete GIS record in memory at any given instant, nor is your program allowed to store a collection of file offsets for the records in the file. Note the instructions given above imply your main class (the one that implements public static void main()) must be named GISParser. If you execute your program from within Eclipse, you can still specify command-line parameters; see the Eclipse for 3114 notes for an example. Testing Sample GIS and command files will be supplied on the course website, along with the corresponding correct results. You must test your solution with each of those samples. There is no guarantee these will cover all the logical cases. Beyond that, it is your responsibility to design and conduct thorough and sensible tests of your implementation before submitting it. For that purpose, you may share test input and output files (but absolutely no solution code!!) via the class Forum. You should not use the Curator for your own testing and debugging purposes. The Curator will only accept a limited number of submissions by each student, so use them wisely for locking in your grades. Solutions to this assignment will be graded on CentOS 7, as installed on the rlogin cluster. The evaluation will be performed using command-line tools only. The Resources page on the course website includes instructions for using command-line tools to compile and execute Java programs. You should consult those, as well as the course staff when you have questions. You may develop your solution on Windows or Linux, as you like, and you may use Eclipse or any other IDE you like. But, it is your responsibility to ensure that you have tested your code with the posted testing/grading code on CentOS 7 using Java 8. Failure to do that is likely to result in a score of zero on this assignment. Note this assignment will be autograded, using the supplied testing code, so the comparison tool will complain unless spelling, punctuation and capitalization are exactly as specified. Grading Code The website has a link to a zip file containing tools you can use to test and grade the correctness of your solution. In fact, these are the same tools that we will use to do the grading of your solution (instead of the Curator s built-in tools). gradej1.sh readme.txt tools.zip GISGenerator.jar NM_EddyKnown.txt NM_EddyUnknown.txt LogComparator.jar bash shell script that runs the grading tools explains how this all works zip file containing: Java test data generator tool input files used by the generator tool comparison tool that scores your results Since this is the exact code that will be used when we grade your solution, it is imperative that you make good use of this code in your own testing. See the readme.txt file for detailed instructions. Documentation You should document your implementation in accordance with the Programming Standards page on the course website. It is possible that your implementation will be evaluated for documentation, as well as for correctness of results. If so, your submission will be evaluated by one of the TAs, who will assess a deduction (ideally zero) against your score from the grading code. Note that the evaluation of your project may depend substantially on the quality of your code and documentation. Version 5.00 This is a purely individual assignment! 5

Design Considerations You should apply good object-oriented design principles in your project. Think through object responsibilities and interactions, and sketch out your design before you start coding. The most common design shortcomings with an assignment like this are to identify a too-small set of candidate classes, or to adopt a minimal design in order to reduce coding time. As inspiration, I will tell you that my solution incorporates 8 distinct classes and 2 enumerated types, all of which play important roles within the requirements of the assignment, and also in my solution for a future assignment. Keep in mind that later projects in this course may build on this one. For example, it is likely that in a later project some other part of the GIS database system will need to actually do something with various fields of the GIS records that are retrieved in this assignment. You should consider what possible errors might be encountered, and which ones you should check for. One common example is the validation of command-line parameters. We will not test your solution with invalid parameters, but in reality it's fairly common for a user to supply incorrect, or no, parameters. It's simple to test for the existence of files, and log an error message and exit cleanly if expected files do not exist. As mentioned earlier, a GIS record may not specify a value for every field. In other cases, but not this assignment, logically invalid values may be supplied; for example, a character string (e.g., "Fred") where a numeric value is expected. A program that crashes, or even just computes nonsensical results, in such a situation looks very unprofessional. Finally, you should be careful about designing your solution to catch exceptions and write useful diagnostics to standard output if an exception is caught, especially if it cannot be handled internally. What to Submit For this assignment, you must submit a zip file containing all the Java source code files for your implementation (i.e.,.java files). Submit only the Java source files. Do not submit Java bytecode (class) files. If you use packages in your implementation (and that's good practice), your zip file must include the correct directory structure for those packages. That's easy to verify by running the supplied grading code on the zip file you are planning to submit. This assignment will be auto-graded under CentOS (which should not matter at all) using Java version 1.8u20 or later, and the posted testing/grading code. Your submitted zip file will be placed in the appropriate subdirectory with the packaged test code, and will then be evaluated by running the supplied testing script. We will make no accommodations for submissions that do not work with that script. Warning: the requirement here is for a zip file, based on the fact that there is a standard utility for creating zip files on Linux systems. See "man zip" and "man unzip" for details. We will not accept files in any other format (e.g., tar files, 7-zip'd files, gzip'd files, jar files, rar files, ). Such submissions will NOT work with the supplied script, and will be discarded when we run the grading code ourselves. Instructions, and the appropriate link, for submitting to the Curator are given in the Student Guide at the Curator website: http:www.cs.vt.edu/curator/. You will be allowed to submit your solution multiple times, in order to make corrections. Your score will be determined by testing your last submission. The Curator will not be used to grade your submissions in real time, but you will have already done the grading using the posted code. Version 5.00 This is a purely individual assignment! 6

Pledge Each of your program submissions must be pledged to conform to the Honor Code requirements for this course. Specifically, you must include the following pledge statement at the beginning of the file that contains main(): On my honor: - I have not discussed the Java language code in my program with anyone other than my instructor or the teaching assistants assigned to this course. - I have not used Java language code obtained from another student, or any other unauthorized source, including the Internet, either modified or unmodified. - If any Java language code or documentation used in my program was obtained from another source, such as a text book or course notes, that has been clearly noted with a proper citation in the comments of my program. - I have not designed this program in such a way as to defeat or interfere with the normal operation of the supplied grading code. <Student's Name> <Student's VT email PID> Change Log We reserve the option of assigning a score of zero to any submission that is undocumented or does not contain this statement. Version Date Change(s) 5.00 Jan 15 Base document Version 5.00 This is a purely individual assignment! 7