Miniproject 1. Part 1 Due: 16 February. The coverage problem. Method. Why it is hard. Data. Task1

Size: px

Start display at page:

Download "Miniproject 1. Part 1 Due: 16 February. The coverage problem. Method. Why it is hard. Data. Task1"

Ashlyn Gordon
6 years ago
Views:

1 Miniproject 1 Part 1 Due: 16 February The coverage problem given an assembled transcriptome (RNA) and a reference genome (DNA) what fraction (in bases) of the transcriptome sequences match to annotated genes in the reference genome AND what fraction of the bases in annotated genes match to bases in the transcriptome assembly. Method The transcriptome sequences are used as a query in a Blastn search against the reference genome. The result of the Blast search is a tabular file that gives the coordinates where each transcript matches the genome. Why it is hard Both the transcriptome assembly and the reference annotation contain overlapping regions. For the transcriptome this is because mutliple isoforms are predicted. For the reference, it is because of multiple transcripts being annotated, and because of overlaps between genes (on the same or different strands). Data Arabidopsis GFF3 file - Arabidopsis_thaliana.TAIR10.20.gff3 Blast search - t3_attair10_20_dna_toplevel.blastn BLASTN Query: comp189_c1_seq1 len=431 path=[1219:0-430] Database: /group/mgribsko/data/genomics/arabidopsis/ensembl/dna/arabidopsis_thaliana.tair dna.toplevel.fa Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score 2 hits found comp189_c1_seq comp189_c1_seq e BLASTN Query: comp190_c0_seq1 len=271 path=[249:0-270] Database: /group/mgribsko/data/genomics/arabidopsis/ensembl/dna/arabidopsis_thaliana.tair dna.toplevel.fa 0 hits found Task1 For the transcriptome use the blast matches to identify all of the contiguous regions in the reference genome that are covered by any predicted transcript. The reference genome regions are found from the beginning and ending positions of the matches for the subjects. use variables to code the maximum e-value to consider, and minimum length match to consider. This makes it convenient to try different values The result should be a list of regions (chromosome, begin, and end positions), and a list of transcripts that match to those regions. The transcript IDs here are the queries in the Blast result Produce a summary report that gives a per chromosome breakdown of the number of regions, maximum, minimum, and average length of the regions the number of matching transcripts the unique number of transcripts the average number of transcripts per region

2 see below. These results are new To get the results below i used E<=1e-20 and alignment length >=75. The minimum length in the report can be less than 75, as it is below, if the subject sequence contains a deletion with respect to the query blast results identified seq region maxlen minlen avelen trans unique avetrans Mt Pt solution task 1: read a blast file and report the overlapping regions in the subject. The regions are reported as an array of arrays, with the query IDs present in each region stored as the keys of a hash. this is intended to be used with a blastn search of a transcriptome assembly vs a reference = ( [ chromosome, begin_pos, end_pos, id_hash ],... Michael Gribskov 24 February use strict; use Data::Dumper; my $EVALUE_THRESHOLD = 1e-20; my $LENGTH_THRESHOLD = 75; read in the blast result and save the information we want as an array of hashes my $nblast = 0; while ( my $line = <> ) { next if ( $line =~ /^/ ); skip comment lines my ( $query_id, $subject_id, $identity, $align_len, $mismatch, $gap_open, $q_start, $q_end, $s_start, $s_end, $evalue, $bit_score ) = split " ", $line; check the length and E-value thresholds next unless ( $evalue <= $EVALUE_THRESHOLD ); next unless ( $align_len >= $LENGTH_THRESHOLD ); $nblast++; make sure start < end if ( $s_start > $s_end ) { ( $s_start, $s_end ) = ( $s_end, $s_start ); { subj => $subject_id, query => $query_id, begin => $s_start, end => $s_end ; last if ( $nblast > ); print "$nblast blast results identified\n\n"; now find the overlaps. sort by subject id and starting position my $current = [ '', 0, 0, { ]; a reference to the most recently found region these variables refer to the indices if the elements of the $current array my $seq = 0; my $end = 2; my $id = 3; foreach my $hit ( sort { $a->{subj cmp $b->{subj

3 $a->{begin <=> ) { if ( $hit->{subj ne $$current[$seq] $hit->{begin > $$current[$end] ) { create a new region if sequence chages or begin is greater than end of current region $current = [ $hit->{subj, $hit->{begin, $hit->{end, {$hit->{query => 0 ]; $current; else { extend current region if ( $hit->{end > $$current[$end] ) { $$current[$end] = $hit->{end; $$current[$id]->{$hit->{query = 0;

4 Task2 For the reference genome annotation, identify the contiguous regions covered by transcripts, taking into account possible overlaps on both strands. For this project we will simply use annotated transcript regions, although in real life you might want to work on the exon level. the result should be a list of regions (chromosome, begin, and end pos) and transcript IDs. The Transcript IDs here are those in the GFF file. Because of overlaps, you may have to create some hybrid names. Produce a summary output that shows the number of transcript regions foun in the GFF file, and a per chromosome breakdown of the number of regions, the maximum, minimum, and average region length the total number of transcripts the unique number of transcripts the average number of transcripts per region see below. These results have been double checked and are correct for the current data file regions found for feature transcript seq region maxlen minlen avelen trans unique avetrans Mt Pt Solution

5 task 2: read a GFF3 file and report the overlapping regions. The regions are reported as an array of arrays, with the transcript IDs present in each region stored as the keys of a = ( [ chromosome, begin_pos, end_pos, id_hash ],... an example of the input is shown at the end of the file Michael Gribskov 24 February use strict; use Data::Dumper; the GFF information is already coordinate sorted. As each line that matches the target feature is read it is either added to the current region, or used create a new region. my $FEATURE = 'transcript'; the transcripts in the gff file are not neccesarily sorted by begin position so they must be read in and stored while ( my $line = <> ) { my ( $seq, $source, $feature, $begin, $end, $score, $strand, $frame, $comment ) = split " ", $line; next unless ( $feature =~ /$FEATURE/i ); get the ID from the comment field my ( $id ) = $comment =~ /^ID=([^;]+);/; matches everything between ID= and ; { chromosome=>$seq, begin =>$begin, end => $end, id =>$id ; fields of the elements are sequence, begin_pos, end_pos, and id hash my $current = [ '', 0, 0, { ]; a reference to the most recently found region these variables referes to the indices if the elements of the $current array my $seq = 0; my $end = 2; my $id = 3; foreach my $transcript ( sort { $a->{chromosome cmp $b->{chromosome $a->{begin <=> ) { if ( $transcript->{chromosome ne $$current[$seq] $transcript->{begin > $$current[$end] ) { create a new region if sequence changes, or begin is greater than end of the current region $current = [ $transcript->{chromosome, $transcript->{begin, $transcript->{end, {$transcript->{id => 0 ]; $current; elsif ( $transcript->{end > $$current[$end] ) { extend current region $$current[$end] = $transcript->{end; add the transcript id to the current ID hash $$current[$id]->{$transcript->{id = 0;

6 Task Write a subroutine, compareregions, that compares the the GFF and Blast regions and determines For the transciptome: number of bases that match and do not match to annotated gene regions For the annotated genes: mumber of bases that match and do not match to predicted transcripts USAGE my %stats = compareregions( \@blast_region, \@gff_region ); the %stats hash should have keys gff_only blast_only both giving the count of bases that are found only in the gff annotation, only in the blast search, or in both. 3.2 If you did not write them as subroutines, convert the standalone codes for task1 and task 2 to subroutines. I will provide some standard versions of the codes after everyone turns their homework for the first week in. You can use your own code, but it must be correct. Look carefully at the task3 main program code provided below and make sure your subroutines work with the arguments shown AND return the data structure requested. Each function should return an array of regions, where each region is an array of chromosome, begin_pos, end_pos, and a hash of the match ids with the ids as keys (see the standard code). An example of using these subroutines is = getblastregion( $blast_file ); my $gff_region = getgffregion( $gff_file ); Expected result The columns left to right are. only red values are essential. chrom - chromosome total -total number of bases in chromosome (calculated as if the chromosome is 1 - max(gff_region_end,blast_region_end), obviously this is not really correct genome - total bases in annotated genome (sum of regions from gff file) transcript - total bases in transcriptome (sum of regions from blast file) neither - bases in the chromosome but not in either GFF or transcript, only easy to calculate for the brute force array method. genome - bases only in annotated genome regions transcript - bases only in transcriptome regions both - bases in both genome annotation and transcript trans - percent of transrcipt regions found in annotated transcript - both / total transcript regions genome - percent of annotated genome found in transcripts = both / total genome regions bases bases bases only only % % chrom total genome transcript neither genome transcript both trans genome Mt Pt all Main program

7 given a blastn search of a transcriptome vs a reference genome, and the annotation of the reference genome as a GFF file, tabulate the level of match between the two. Michael Gribskov 17 Feb use strict; use Data::Dumper; my $FEATURE = 'transcript'; my $EVAL_CUTOFF = 1e-20; my $LEN_CUTOFF = 75; my $blast_file = $ARGV[0]; my $gff_file = $ARGV[1]; getgffregion is the subroutine version of task 2 = getgffregion( $FEATURE, $gff_file ); my $g_regions print "$g_regions regions found for feature $FEATURE in $gff_file\n\n"; ); getblastregion is the subroutine version of task 1 = getblastregion( $EVAL_CUTOFF, $LEN_CUTOFF, $blast_file ); my $b_regions print "\n$b_regions regions found for blast search with e-value >= $EVAL_CUTOFF and len >= $LEN_CUTOFF in $blast_file\n\n"; ); this is the new code you need to write for task 3 my %stat = compareregions( \@blast_region, \@gff_region ); exit 0;

8 report calculate some statistics: per chromosome, longest, average and shortest region, number of regions, average number of transcript isoforms per region usage ); sub report{ my ) my ( %len_max, %len_sum, %len_min ); my ( %count_region, my %count_transcript ); my %unique_trans; foreach my $r ) { my ( $chromosome, $begin, $end, $id ) foreach my $i ( keys %$id ) { $unique_trans{$chromosome{$i++; $count_region{$chromosome++; $count_transcript{$chromosome += keys %$id; my $len = $end - $begin + 1; $len_sum{$chromosome += $len; if (!$len_max{$chromosome $len_max{$chromosome < $len ) { $len_max{$chromosome = $len; if (!$len_min{$chromosome $len_min{$chromosome > $len ) { $len_min{$chromosome = $len; print "seq region maxlen minlen avelen trans unique avetrans\n"; foreach my $chromosome ( sort keys %count_region ) { my $ave_len = $len_sum{$chromosome / $count_region{$chromosome; my $ave_trans = $count_transcript{$chromosome / $count_region{$chromosome; my $unique = keys %{$unique_trans{$chromosome; printf "%2s %6d %6d %6d %8.2f %6d %6d %7.2f\n", $chromosome, $count_region{$chromosome, $len_max{$chromosome, $len_min{$chromosome, $ave_len, $count_transcript{$chromosome, $unique, $ave_trans; return; end of report

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence