Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017

Similar documents
Programming Standards: You must conform to good programming/documentation standards. Some specifics:

CS 3114 Data Structures and Algorithms DRAFT Project 2: BST Generic

1 The Var Shell (vsh)

Access Intermediate

Project #1 Exceptions and Simple System Calls

Assignment 1: Communicating with Programs

CSC209. Software Tools and Systems Programming.

CS3114 (Fall 2013) PROGRAMMING ASSIGNMENT #2 Due Tuesday, October 11:00 PM for 100 points Due Monday, October 11:00 PM for 10 point bonus

Homework Assignment #3

Data Structure and Algorithm Homework #5 Due: 2:00pm, Thursday, May 31, 2012 TA === Homework submission instructions ===

Running Java Programs

Essential Skills for Bioinformatics: Unix/Linux

LING 473: Day 8. START THE RECORDING Computational Complexity Text Processing

Project 1 Balanced binary

Programming Assignment IV Due Thursday, June 1, 2017 at 11:59pm

The print queue was too long. The print queue is always too long shortly before assignments are due. Print your documentation

Data Structure and Algorithm Homework #3 Due: 2:20pm, Tuesday, April 9, 2013 TA === Homework submission instructions ===

Remaining Enhanced Labs

CS 1110, LAB 03: STRINGS; TESTING First Name: Last Name: NetID:

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Bioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny.

ASSIGNMENT 1: TWEET TOKENIZATION & NORMALIZATION

Programming Project 4: COOL Code Generation

Scripting Languages Course 1. Diana Trandabăț

Hands on Assignment 1

CSC209. Software Tools and Systems Programming.

Data Structure and Algorithm Homework #3 Due: 1:20pm, Thursday, May 16, 2017 TA === Homework submission instructions ===

OPERATING SYSTEMS ASSIGNMENT 4 XV6 file system

Final Project: LC-3 Simulator

Data Structure and Algorithm Homework #2 Due: 2:00pm, Thursday, March 29, 2012 TA

CSE 361S Intro to Systems Software Lab #4

Lesson 10A OOP Fundamentals. By John B. Owen All rights reserved 2011, revised 2014

CS ) PROGRAMMING ASSIGNMENT 11:00 PM 11:00 PM

CS 1550 Project 3: File Systems Directories Due: Sunday, July 22, 2012, 11:59pm Completed Due: Sunday, July 29, 2012, 11:59pm

Lab #0 Getting Started Due In Your Lab, August 25, 2004

Programming Assignment 1

15-110: Principles of Computing, Spring 2018

Programming Assignment IV Due Thursday, November 18th, 2010 at 11:59 PM

Implementing Dynamic Minimal-prefix Tries

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

CSE451 Winter 2012 Project #4 Out: February 8, 2012 Due: March 5, 2011 by 11:59 PM (late assignments will lose ½ grade point per day)

CE 221 Data Structures and Algorithms

Lecture 5. Essential skills for bioinformatics: Unix/Linux

Unix background. COMP9021, Session 2, Using the Terminal application, open an x-term window. You type your commands in an x-term window.

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

File Structures and Indexing

DO NOT REPRODUCE. CS61B, Fall 2008 Test #3 (revised) P. N. Hilfinger

$Id: asg4-shell-tree.mm,v :36: $

Genome Browsers - The UCSC Genome Browser

Due: 9 February 2017 at 1159pm (2359, Pacific Standard Time)

Project 1: Implementing a Shell

Vendor Registration and Training

CSE 303 Midterm Exam

Lab 1: Introduction to C, ASCII ART & the Linux Command Line

FIT 100: Fluency with Information Technology

University of Arizona, Department of Computer Science. CSc 453 Assignment 5 Due 23:59, Dec points. Christian Collberg November 19, 2002

CS 4218 Software Testing and Debugging Ack: Tan Shin Hwei for project description formulation

Distributed Memory Programming With MPI Computer Lab Exercises

Using the Unix system. UNIX Introduction

Introduction to Programming Style

ACCUPLACER Placement Validity Study Guide

Creating a Shell or Command Interperter Program CSCI411 Lab

Lab 2: Threads and Processes

COMP 4/6262: Programming UNIX

COMP/ELEC 429/556 Project 2: Reliable File Transfer Protocol

CS 1044 Program 6 Summer I dimension ??????

Project 6: Assembler

An Introduction to Cluster Computing Using Newton

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network?

Due Friday, March 20 at 11:59 p.m. Write and submit one Java program, Sequence.java, as described on the next page.

Due: March 8, 11:59pm. Project 1

CS 1110, LAB 3: STRINGS; TESTING

I/O and Shell Scripting

Google Drive: Access and organize your files

Programming Assignment 2 ( 100 Points )

EE 422C HW 6 Multithreaded Programming

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

Final Examination CSE 100 UCSD (Practice)

CSE Theory of Computing Fall 2017 Project 2-Finite Automata

CSCI544, Fall 2016: Assignment 1

HHH Instructional Computing Fall

ITNPBD7 Cluster Computing Spring Using Condor

Running SNAP. The SNAP Team February 2012

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

CSCI2467: Systems Programming Concepts

Notes to Instructors Concerning the BLITZ Projects

COMP 3500 Introduction to Operating Systems Project 5 Virtual Memory Manager

CIS 121 Data Structures and Algorithms with Java Spring 2018

Operating Systems (234123) Spring 2017 (Homework Wet 1) Homework 1 Wet

CSE 143: Computer Programming II Spring 2015 HW7: 20 Questions (due Thursday, May 28, :30pm)

15-780: Problem Set #2

Using Moodle activities: Wiki

Exercise Sheet 2. (Classifications of Operating Systems)

UNIVERSITY OF NEBRASKA AT OMAHA Computer Science 4500/8506 Operating Systems Summer 2016 Programming Assignment 1 Introduction The purpose of this

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 2: SEP. 8TH INSTRUCTOR: JIAYIN WANG

8 MANAGING SHARED FOLDERS & DATA

CS 3204 Operating Systems Programming Project #2 Job / CPU Scheduling Dr. Sallie Henry Spring 2001 Due on February 27, 2001.

Project #3. Computer Science 2334 Fall Create a Graphical User Interface and import/export software for a Hurricane Database.

Running SNAP. The SNAP Team October 2012

Buffer Manager: Project 1 Assignment

Transcription:

Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017 Bioinformatics refers the application of statistics and computer science to the management and analysis of data from the biosciences. In common with computational linguistics, the field shares the task of detecting and searching for patterns in gigantic datasets. For this project you search for DNA sequences in the 24 chromosomes of the complete hg19 GRCh37 human genome. This base pairs are represented in text form, so the problem is essentially a text processing problem. There are 4,965 target sequences. For each of these you will identify the chromosome file and the offset within the file for all matches. This problem was selected because this search cannot be performed efficiently without special attention to the algorithm that is used. Using grep around five thousand times to search 2.8 gigabytes probably won t end well, even in this simplified exercise. Instead, you will store the search targets in a prefix trie and make a single pass through the genome corpus. Because a position in the trie keeps track of multiple simultaneous searches, this reduces the problem from O NM to O(N log M). Hashing cannot be used for this purpose, because the target sequences are different lengths, so you don t know how many characters from the input to use for computing a hash. Human Genome Corpus The following directory on patas contains 24 files, corresponding to human chromosomes 1 through 22, plus X and Y. Each file describes an ordered sequence of nucleotide bases A, T, C or G. /opt/dropbox/17-18/473/project4/hg19-grch37/ These files come from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigzips/hg19.2bit. You may copy them to your home machine for local development, but your final code must run on patas. Masking In accordance with convention for DNA data, some of the nucleotides are represented by the corresponding lower-case letter. This indicates that they are masked, meaning that they are of low complexity and thus believed to be less important. For this project, you will do a caseinsensitive search, so masked areas will be included (the target sequences are specified as all uppercase). Unknown There is no data available for some parts of the genome. These areas are indicated by the letter N. None of the target sequences contain N, and your trie does not need to be able to recognize N. Loading the Target Sequences You will develop a trie structure. Each trie node makes reference to (up to) four child nodes, one for each nucleotide. Any or all of these references can be null. When all four children of a node are null, the node is a terminal node. Each node (except the root) has exactly one parent, but nodes do not need to store a reference to their parent node, because navigation through a trie is always one-way, starting from the root. The elegance of a trie is that every node implicitly represents a unique string merely by its position.

Some trie implementations allow each node to store a payload. One purpose of the payload is to signal whether the node represents the last character in a target string. In this case, the payload would be null if the trie node is not a stopping point. This sort of mechanism (for differentiating valid acceptance points) is required if the trie has to store strings that are exact prefixes of other strings that are also stored. For example, { cat catamaran }. In our application, none of the target sequences start with any of the others, so we don t need to use a payload. Instead, we will know if we are at an acceptance point (i.e. that we found a match) if the node has no child nodes. After defining the node structure, you will instantiate a single node to represent the trie root. Next, you will populate the trie with each target sequence. The target sequences are in the following file, one per line: /opt/dropbox/17-18/473/project4/targets Add each target sequence to the trie by starting at the trie root and following the sequence s path of nucleotides. This is done by advancing through the target string and navigating/building the trie at the same time. For example, if you come to a child node that is missing, add it by creating a new node. When you reach the end of the target string, you should be at a node with no children a terminal node that you just created. If your trie nodes store a payload (not required here), you would store the payload in this node in order to mark the node as a completed match. In such a design, matching can occur at non-terminal nodes. The target strings can be discarded now, since they are all stored in the trie implicitly (and explicitly as well, if you chose to store each string as a payload). Searching Now that you have a trie that represents the strings that you are searching for, you can scan the corpus in a single pass, finding all the matches. You will still need to check every character position in the corpus, but the trie allows you to check all targets at the same time and abort as soon as any match from that position becomes impossible. You will need to keep track of two index positions in each genome file i, j and one navigation position a pointer to a node, pn within the trie. The outer index position i will always advance one character at a time. As you read from the corpus, don t forget to normalize the letter-case to whatever you stored in the trie, since we do wish to search within masked data. For each of these outer positions, the navigation position pn will start at the trie root, and a second inner character position j will be initialized to the value of the outer index position i. Now, similar to how the trie was loaded, pn will attempt to navigate through the trie, according to the characters read from the corpus, beginning from j. Of course, this time we don t modify the tree; if further navigation becomes impossible and we are not at a terminal node, there is no match, so increment the overall position i, set j = i and pn = root, and repeat. If you are able to reach a terminal node by matching the corpus to the trie, then you have found a match. If you didn t store the target sequence as a payload in the terminal node, it s easy enough to recover the string that you ve matched: it s the substring of the input from i to j, inclusive. (Note that you don t need to build it as you go.) Record each match as described in the Output section below.

Performance A major challenge will be dealing with the large files in the human genome corpus, since j does have to continually navigate forward and then reset (seek) backwards within a small area. If you are able to load each file into memory, this isn t a problem. Otherwise, you might consider using file seeking for all operations on the corpus, reading just one character at a time. This is a great solution if you can count on the operating system to buffer your read operations. Even better would be to memory-map the file you re reading, an advanced technique which might be available in your programming language or OS. Your program will certainly be long-running. My single-threaded solution, written in C#, executes in about 4 minutes (4:09) on a 3.17GHz machine. If you are using patas/dryas, please submit your job to condor. You can use the condor-exec shortcut for this. If execution times are prohibitive, you can report results for a smaller fraction of the corpus (5 points deduction). Output Format Output your results to the console (stdout). Print each filename on a new line (the full path can be included). All matches found in that file will follow, with each match on a single line that begins with a tab. After the tab character, print the zero-based offset of the match within the file (please use hex numgers). Follow this with a space (or tab) and the sequence that was found. This can be mixed-case, all-caps, or all-lower. Your output should look like this example fragment: /opt/dropbox/16-17/473/project4/hg19-grch37/chr1.dna 0000C341 AAACTAACTGAATGTTAGAACCAACTCCTGATAAGTCTTGAACAAAAG 00022767 ggggctggagactgacttaatcaccaacagccaaaggttttatcaatcatgcttgcataataaagcctc /opt/dropbox/16-17/473/project4/hg19-grch37/chr10.dna 0013F466 GGCCTGAGAGGGGGGCCCAGGCTCTCCCGGAAGACGGCCTGAGCCAGGTCCACGCTCCCCCGGAAGACGGCCTGAGA 0013F4AA GGCCTGAGAGGGGGGCCCAGGCTCTCCCGGAAGACGGCCTGAGCCAGGTCCACGCTCCCCCGGAAGACGGCCTGAGA /opt/dropbox/16-17/473/project4/hg19-grch37/chr11.dna 00021443 ggggctggagactgacttaatcaccaacagccaaaggttttatcaatcatgcttgcataataaagcctc Extra Credit For extra credit, have your program write a separate output file where the results are grouped together by sequence, rather than by file. Name this file extra-credit (note: no file extension!) and write it to the current directory. TCATTCCAAGAAAAAGTCTTAGGAGTGCAGCACTTCAAAATCAGGTAATG 028751CE chr9.dna 040D4F63 chr9.dna 042383BC chr9.dna AAACTAACTGAATGTTAGAACCAACTCCTGATAAGTCTTGAACAAAAG 0000C341 chr1.dna 000165CD chr19.dna TTGGGCTTGGGGTACAGGAGGAGTTGGTGGGGTGCCTGTGGCCACTCCACTGCCCTCTGGGATTAGGAGGAGACAGTGGGGTCAGGACTCA 00CC6419 chr1.dna 00CFC296 chr1.dna cacacacacacatatatatgtagagacagggtttctccttggtgcccagtttggtct 0465C8FF chr1.dna caagataccaattcaagtatggaatttaagcggtgacaagttaatctaaccctattacaaata 0469AF3B chr1.dna

Submission As a courtesy to your fellow students, please use Condor when you test on patas/dryas. However, for this assignment do not include any Condor-related commands or files in your submission. (I will create a universal condor.cmd to execute everyone s assignments.) Include the following files in your submission: compile.sh run.sh output extra-credit readme.{pdf, txt} (source code and binary files) Contains command(s) that compile your program. If you are using python, shell scripts, or any other interpreted language that does not require compiling, then this file will be empty, or contain just the single line: #!/bin/sh The command(s) that run your program. Be sure to include compiled binaries in your submission so that this script will execute without first running compile.sh This is the output of your program, captured from the console by executing the following command: $./run.sh >output Note that your shell script will not create the output file itself, but print to standard out. Extra credit output (optional) Your write-up of the project. Describe your approach, any problems or special features, or anything else you d like me to review. If you could not complete some or all of the project s goals, please explain what you were able to complete. All source code and binary files (jar, a.out, etc., if any) required to run and compile your program Gather together all the required files, making sure that, for example, any PDF or other binary files are transferred from your local machine using a binary transmission format. Then, from within the directory containing your files, issue the following command to package your files for submission. Replace the bracketed portion with your UW NetID. tar -czf [your-uw-netid].tar.gz. Notice that this command packages all files in the current directory; do not include any top-level directories. Upload the file to CollectIt. Grading Correct results 30 Follow submitting instructions 10 Clarity, elegance, and readability of code 25 Program efficiency 20 Write-up 15 Extra Credit +10

Citation Human Genome 19, GRCh37 Genome Reference Consortium Human Reference 37, Primary Assembly (GCA_000001405.1), February 2009. GRCh37 is a haploid assembly, constructed from multiple individuals. The primary assembly represents the assembled chromosomes, plus any unlocalized or unplaced sequence that represent the non-redundant, haploid assembly.