SUPPLEMENTARY INFORMATION

Size: px

Start display at page:

Download "SUPPLEMENTARY INFORMATION"

Muriel Wilcox
5 years ago
Views:

1 doi: /nature25174 Sequences of DNA primers used in this study. Primer Primer sequence code 498 GTCCAGATCTTGATTAAGAAAAATGAAGAAA F pegfp 499 GTCCAGATCTTGGTTAAGAAAAATGAAGAAA F pegfp 500 GTCCCTGCAGCCTAGAGGGTTAGG R pegfp 495 GTCCCTCGAGAGCCAGACACAA F pdluc-stopgo-emcv IRES 494 GCGTTGCTCGGGCCC R pdluc-stopgo-emcv IRES 496 AACCCCGGGCCCGAGCAACGCTCGCCCCAGAAGATTGAA F pdluc-stopgo-emcv IRES 487 TGAGGCCAACACCTAATGAGGACGAAAGCCTTGT R pdluc-emcv IRES 486 AGATCTTAGAACAGTCCTAGAGGGTTAGGCTGAGGCCAA R pdluc-emcv IRES CACCTAATGA 485 GTCCCTCGAGGAAGCAGCAAC F pdluc-emcv IRES 1710 ATAACTCGAGGAAGCAGCAACAACAGCAGAG F pcdna3-ha 1711 TTATAGATCTATGAGGACGAAAGCCTTGTCTGTGG R pcdna3-ha 614 CTGGAGACATAGCTTACTGG F FLuc qpcr 615 GGAAAGACGATGACGGAA R FLuc qpcr 616 GCGTGACATTAAGGAGAAG F ßActin qpcr 617 AAGGAAGGCTGGAAGAG R ßActin qpcr 444 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTG TCCTACGAGTTGCATG R PCR for mrna synthesis 445 TCGGCCTCTGAGCTATTC F PCR for mrna synthesis 1513 ATAAGCTAGCGAAGCTGCACATTTTTTCGAAGGGACC F AMD1 variant 1 NheI 1514 R TTATCTCGAGTTACTAATGAGGACGAAAGCC AMD1 variant 1 XhoI 1515 F AMD1 variant 1 TGA-TGG AGCAACAACAGCAGAGTTGGTTAAGAAAAATGAAGAAA IFC 1516 R AMD1 variant 1 TGA-TGG TTTCTTCATTTTTCTTAACCAACTCTGCTGTTGTTGCT IFC 1517 F AMD1 variant 1 TGA-TAA AGCAACAACAGCAGAGTTAATTAAGAAAAATGAAGAAA Neg Co R AMD1 variant 1 TGA-TAA TTTCTTCATTTTTCTTAATTAACTCTGCTGTTGTTGCT Neg Co AGCAACAACAGCAGAGTTAATAATTAAGAAAAATGAAG AAA F AMD1 variant 1 TGA- TAATAA Neg Co R AMD1 variant 1 TGA- TTTCTTCATTTTTCTTAATTATTAACTCTGCTGTTGTTGCT TAATAA Neg Co2 781 GGTACTGTTGGTAAAGCCACCATGGCT FOR Renilla qpcr 782 CGACGTGCCTCCACAGGTAGC REV Renilla qpcr 1

2 Python code for identifying peaks of ribosome density in extended ORFs #!/usr/bin/env python # Script to find riboseq peaks between the annotated stop of coding transcripts and the next in frame stop. Any peak overlapping with another annotated CDS # region will be ignored. Takes in the following arguments path_to_gtf_file, path_to_bed_file, path_to_fasta_file, name_of_output_file import sys import shelve from intervaltree import Interval, IntervalTree # Discard genes with downstream peaks lower than this value. MIN_PEAK_COUNT = 500 # Path to gtf file. gtffile = open(sys.argv[1], "r") # Path to bedfile with riboseq counts aligned to the genome. bedfile = open(sys.argv[2],"r") # Path to genomic fasta file. fastafile = open(sys.argv[3],"r") outfile = open(sys.argv[4],"w") outfile.write("chrom,position,tran,gene,max_peak\n") # The following four dictionaries hold tuples corresponding to their names for each chromosome. all_cds = {} all_utr = {} all_other = {} all_genes = {} # Holds the final peak count for each gene. 2

3 master_dict = {} # tree_dict holds the interval trees for each chromosome. There are 3 trees for each corresponding to cds, utr and 'other' regions. tree_dict = {} # gene_dict holds chromosomes as top level keys which point to dictionaries of genes. Each gene points to another dictionary of transcripts which holds information such as readthrough coordinates. gene_dict = {} # top level keys are chromosomes in tran_dict, each chromosome has a dictionary of transcript id's which hold information such as CDS length. tran_dict = {} # fasta_dict holds the nucleotide sequences for each chromosome. fasta_dict = {} # Used to store the positions and counts from the bedfile. footprint_dict = {} # Stores the same info as footprint_dict but with each position annotated as either cds, utr or other. annotated_footprint_dict = {} # This block reads the fasta file putting each header as a key in fasta_dict with it's value as the nucleotide sequence. infasta = fastafile.read() fastafile.close() splitfasta = infasta.split(">") for item in splitfasta[1:]: header = (item.split("\n")[0]).strip(">") nucs_list = (item.split("\n")[1:]) nucs = ("".join(nucs_list)).replace("\n","") 3

4 fasta_dict[header] = nucs # Parses gtf file to populate the tran_dict and the 4 dictionaries all_cds, all_utr, all_other, all_genes for line in gtffile: if line[0]!= '#': splitline = line.split("\t") chrom = splitline[0] annot_type = splitline[2].lower() strand = splitline[6] if strand == "+": start = int(splitline[3]) end = int(splitline[4]) start = int(splitline[3])+1 end = int(splitline[4])+1 strand = splitline[6] desc = splitline[8] if annot_type == "cds": tran = ((desc.split('transcript_id "')[1]).split(".")[0]).split(".")[0] gene = ((desc.split('gene_name "')[1]).split('"')[0]).split(".")[0] if chrom in tran_dict: if tran in tran_dict[chrom]: tran_dict[chrom][tran]["cds"].append((start, end)) tran_dict[chrom][tran] = {"CDS": [(start, end)], "UTR":[], "STRAND":strand, "GENE":gene} tran_dict[chrom] = {tran:{"cds":[(start, end)], "UTR":[],"STRAND":strand, "GENE":gene}} if start!= end: if chrom not in all_genes: all_genes[chrom] = [] 4

5 if chrom not in all_cds: all_cds[chrom] = [] all_genes[chrom].append((gene, strand, start, end)) all_cds[chrom].append((start-1, end+1)) elif annot_type == "stop_codon": if start!= end: if strand == "+": end = end+6 elif strand == "-": start = start-6 if chrom not in all_genes: all_genes[chrom] = [] if chrom not in all_cds: all_cds[chrom] = [] all_genes[chrom].append((gene, strand, start, end)) all_cds[chrom].append((start-1, end+1)) elif annot_type == "utr": tran = ((desc.split('transcript_id "')[1]).split(".")[0]).split(".")[0] gene = ((desc.split('gene_name "')[1]).split('"')[0]).split(".")[0] if chrom in tran_dict: if tran in tran_dict[chrom]: tran_dict[chrom][tran]["utr"].append((start, end)) tran_dict[chrom][tran] = {"CDS": [], "UTR":[(start,end)], "STRAND":strand, "GENE":gene} tran_dict[chrom] = {tran:{"cds":[(start, end)], "UTR":[], "STRAND":strand,"GENE":gene}} if start!= end: try: all_utr[chrom].append((start, end)) 5

6 except: all_utr[chrom] = [(start, end)] if start!= end: try: all_other[chrom].append((start, end)) except: all_other[chrom] = [(start, end)] gtffile.close() # For each coding transcript poplulate gene_dict with information such as cds co-ordinates, strand, readthrough length. for chrom in tran_dict: for transcript in tran_dict[chrom]: if tran_dict[chrom][transcript]["cds"]!= []: total_length = 0 strand = tran_dict[chrom][transcript]["strand"] gene = tran_dict[chrom][transcript]["gene"] cds_point = tran_dict[chrom][transcript]["cds"][0][0] utrs = tran_dict[chrom][transcript]["utr"] for tup in utrs: total_length += (tup[1] - tup[0])+1 for tup in tran_dict[chrom][transcript]["cds"]: total_length += (tup[1] - tup[0])+1 three_trailers = [] fixed_three_trailers = [] three_trailer_len = 0 fixed_cds = [] cds_len = 0 6

7 if strand == "+": for tup in utrs: if tup[0] >cds_point: three_trailers.append(tup) sorted_three_trailers = sorted(three_trailers, key=lambda x: x[0]) # append one to the end of each interval (this is beacause interval trees are non inclusive) for tup in sorted_three_trailers[1:]: fixed_three_trailers.append((tup[0],tup[1]+1)) three_trailer_len += abs(tup[1]-tup[0])+1 if sorted_three_trailers!= []: fixed_three_trailers.append((sorted_three_trailers[0][0]+3,sorted_three_trailers[0][1]+1)) three_trailer_len += abs(sorted_three_trailers[0][0] - sorted_three_trailers[0][1])+1 sorted_cds = sorted(tran_dict[chrom][transcript]["cds"], key=lambda x: x[0]) for tup in sorted_cds[:-1]: cds_len += abs(tup[1]-tup[0])+1 fixed_cds.append((tup[0], tup[1]+1)) if sorted_cds!= []: fixed_cds.append((sorted_cds[-1][0], sorted_cds[-1][1]+4)) cds_len += abs(sorted_cds[-1][0]-sorted_cds[-1][1])+1 genomic_cds_start = sorted_cds[0][0] genomic_cds_stop = sorted_cds[-1][1] cds_stop = total_length-three_trailer_len for tup in utrs: if tup[0] < cds_point: three_trailers.append(tup) sorted_three_trailers = sorted(three_trailers, key=lambda x: x[0]) # append one to the end of each interval (this is beacause interval trees are non inclusive) for tup in sorted_three_trailers[:-1]: 7

8 fixed_three_trailers.append((tup[0]-2,tup[1])) three_trailer_len += abs(tup[1]-tup[0])+1 if sorted_three_trailers!= []: 3)) fixed_three_trailers.append((sorted_three_trailers[-1][0]-2,sorted_three_trailers[-1][1]- three_trailer_len += abs(sorted_three_trailers[-1][0]-sorted_three_trailers[-1][1])+1 sorted_cds = sorted(tran_dict[chrom][transcript]["cds"], key=lambda x: x[0]) for tup in sorted_cds[1:]: cds_len += abs(tup[1]-tup[0])+1 fixed_cds.append((tup[0]-1, tup[1])) if sorted_cds!= []: fixed_cds.append((sorted_cds[0][0]-4, sorted_cds[0][1])) cds_len += abs(sorted_cds[0][0]-sorted_cds[0][1])+1 genomic_cds_start = sorted_cds[-1][1] genomic_cds_stop = sorted_cds[0][0] cds_stop = total_length - three_trailer_len if chrom not in gene_dict: gene_dict[chrom] = {} if gene not in gene_dict[chrom]: gene_dict[chrom][gene] = {} gene_dict[chrom][gene][transcript] = {"CDS":fixed_cds, "3UTR": fixed_three_trailers, "STRAND": strand, "LENGTH":total_length, "THREE_TRAILER_LEN":three_trailer_len, "CDS_LEN":cds_len, "CDS_STOP":cds_stop, "GENOMIC_CDS_START":genomic_cds_start, "GENOMIC_CDS_STOP":genomic_cds_stop} 8

9 # Given a nucleotide sequence returns the reverse complement. def get_comp_seq(inseq): upseq = inseq.upper() lowseq = upseq.replace("a","t").replace("t","a").replace("g","c").replace("c","g") return lowseq.upper() # Find the readthrough co-ordinates for all transcripts. all_stops = ["TAG","TAA","TGA"] for chrom in gene_dict: if chrom not in fasta_dict: print "Skipping chrom {} as it is not present in the fasta file".format(chrom) continue for gene in gene_dict[chrom]: for tran in gene_dict[chrom][gene]: readthrough_intron = False minusone_intron = False plusone_intron = False strand = gene_dict[chrom][gene][tran]["strand"] three_trailers = gene_dict[chrom][gene][tran]["3utr"] three_trailers = sorted(three_trailers, key=lambda x: x[0]) seq = "" for tup in three_trailers: tup_seq = fasta_dict[chrom][tup[0]-1:tup[1]-1] seq+= tup_seq if strand == "-": seq = get_comp_seq(seq[::-1]) fixed_seq = seq[3:] 9

10 readthrough_len = 0 readthrough_coords = [] for i in range(0,len(fixed_seq),3): codon = fixed_seq[i:i+3] readthrough_len +=3 if codon in all_stops: break temp_readthrough_len = readthrough_len if strand == "-": three_trailers = three_trailers[::-1] for tup in three_trailers: tup_len = tup[1] - tup[0] if temp_readthrough_len > tup_len: readthrough_intron = True temp_readthrough_len -= tup_len if strand == "+": readthrough_coords.append((tup[0]+3,tup[1])) elif strand == "-": readthrough_coords.append((tup[1],tup[0]-1)) if readthrough_intron == False: if strand == "+": readthrough_coords.append((tup[0],tup[0]+readthrough_len+2)) elif strand == "-": readthrough_coords.append(((tup[1]-readthrough_len)-3,tup[1]-4)) if strand == "+": readthrough_coords.append((tup[0],tup[0]+readthrough_len+2)) if strand == "-": readthrough_coords.append(((tup[1]-readthrough_len)-3,tup[1]-1)) break 10

11 gene_dict[chrom][gene][tran]["readthrough_coordinates"] = readthrough_coords gene_dict[chrom][gene][tran]["readthrough_len"] = readthrough_len # Create an interval tree for each annotation type (i.e cds, utr, or other) for every chromosome, store these trees in tree_dict. # Interval trees are created to allow for rapidly checking if a given riboseq peak overlaps with a CDS region. for key in all_cds: tree_dict[key] = {"CDS":IntervalTree([Interval(-1, 0)]), "UTR":IntervalTree([Interval(-1, 0)]), "OTHER":IntervalTree([Interval(-1, 0)])} tree = IntervalTree.from_tuples(all_cds[key]) tree_dict[key]["cds"] = tree for key in all_utr: tree = IntervalTree.from_tuples(all_utr[key]) tree_dict[key]["utr"] = tree for key in all_other: tree = IntervalTree.from_tuples(all_other[key]) tree_dict[key]["other"] = tree # Parse the bedfile and put positions and counts in footprint_dict. for line in bedfile: splitline = line.split("\t") chrom = splitline[0] start = int(splitline[1]) # Majority of reads will be in integer format but some are in scientific notation which int() will fail to parse. try: count = int(splitline[3].replace("\n","")) except: count = float(splitline[3].replace("\n","")) 11

12 if chrom not in footprint_dict: footprint_dict[chrom] = [] footprint_dict[chrom].append((start, count)) bedfile.close() # For each count in footprint dict use the interval trees to check if it overlaps with a CDS, UTR, OTHER or INTERGENIC region and add give it the corresponding label in annotated_footprint_dict. for chrom in footprint_dict: if chrom not in tree_dict.keys(): continue # Create several lists which reads will be recursively placed in if they do not match the current category. footprint_list = footprint_dict[chrom] subfootprint_list = [] subtwofootprint_list = [] subthreefootprint_list = [] for tup in footprint_list: position, count = tup if tree_dict[chrom]["cds"].overlaps(position) == True: try: annotated_footprint_dict[chrom]["cds"][position] = count except: annotated_footprint_dict[chrom] = {"CDS":{}, "UTR":{}, "OTHER":{}, "INTERGENIC":{}} annotated_footprint_dict[chrom]["cds"][position] = count subfootprint_list.append(tup) for tup in subfootprint_list: position, count = tup if tree_dict[chrom]["utr"].overlaps(position) == True: annotated_footprint_dict[chrom]["utr"][position] = count 12

13 subtwofootprint_list.append(tup) for tup in subtwofootprint_list: position, count = tup if tree_dict[chrom]["other"].overlaps(position) == True: annotated_footprint_dict[chrom]["other"][position] = count subthreefootprint_list.append(tup) for tup in subthreefootprint_list: position, count = tup annotated_footprint_dict[chrom]["intergenic"][position] = count # For each gene find the highest riboseq peak between the annotated stop and next inframe stop that does not overlap with another annotated CDS region. for chrom in gene_dict: if chrom not in annotated_footprint_dict.keys(): print "Chrom {} is not in annotated_footprint_dict, skipping".format(chrom) continue for gene in gene_dict[chrom]: # For cases with multiple transcripts only pick the one with the longest 3' trailer, unless the transcripts have different annotated stop codons. accepted_trans = {} genomic_stops = [] for tran in gene_dict[chrom][gene]: genomic_cds_stop = gene_dict[chrom][gene][tran]["genomic_cds_stop"] three_utr_len = gene_dict[chrom][gene][tran]["three_trailer_len"] if genomic_cds_stop in accepted_trans: if three_utr_len > accepted_trans[genomic_cds_stop][0]: accepted_trans[genomic_cds_stop] = [three_utr_len, tran] accepted_trans[genomic_cds_stop] = [three_utr_len, tran] accepted_tran_list = [] 13

14 for key in accepted_trans: accepted_tran_list.append(accepted_trans[key][1]) # For all accepted transcripts find the highest riboseq peak in the readthrough co-ordinates using counts that have been annotated as UTR. for tran in accepted_tran_list: temp_dict = {} if tran not in gene_dict[chrom][gene]: print "tran not in gene_dict" continue if "READTHROUGH_COORDINATES" not in gene_dict[chrom][gene][tran]: print "skipping transcript {} for gene {}".format(tran, gene) continue readthrough_coords = gene_dict[chrom][gene][tran]["readthrough_coordinates"] strand = gene_dict[chrom][gene][tran]["strand"] for tup in gene_dict[chrom][gene][tran]["3utr"]: for i in range(tup[0], tup[1]): if i in annotated_footprint_dict[chrom]["utr"]: max_rt = 0 temp_dict[i] = annotated_footprint_dict[chrom]["utr"][i] max_rt_pos = 0 for tup in readthrough_coords: for i in range(tup[0], tup[1]): if i in temp_dict: if temp_dict[i] > max_rt: max_rt = temp_dict[i] max_rt_pos = i # Correct for 0 based co-ordinates max_rt_pos = max_rt_pos+1 # Add this gene to master_dict, unless it has already been added in which case replace it only if the max peak position for this transcript is higher. if gene not in master_dict: 14

15 master_dict[gene] = {"chrom":chrom, "position":max_rt_pos, "gene":gene, "transcript":tran, "max_peak":max_rt} if max_rt > master_dict[gene]["max_peak"]: master_dict[gene] = {"chrom":chrom, "position":max_rt_pos, "gene":gene, "transcript":tran, "max_peak":max_rt} # Sort the master_dict from highest to lowest max peak count, if max peak count is greater than MIN_PEAK_COUNT then write it to outfile. for gene in sorted(master_dict.keys(), key=lambda x: (master_dict[x]["max_peak"]), reverse=true): if master_dict[gene]["max_peak"] > MIN_PEAK_COUNT: outfile.write("{},{},{},{},{}\n".format(master_dict[gene]["chrom"], master_dict[gene]["position"], master_dict[gene]["transcript"], gene, master_dict[gene]["max_peak"])) 15

16 Guide to individual supplementary items Supplementary Figure 1. Title: Source Data (Gels). Description: Original source Images of the gels that have been used for making figures with weight markers. Cropped parts are indicated. Supplementary Information Title: Supplementary Information Description: Information on (i) oligonucleotides used, (ii) python code and a (iii) guide to additional supplementary items. Supplementary Data 1 Title: Genomic alignment of tetrapods from UCSC Genome browser 100 species alignment. Description: Codon alignment obtained with CodAlignView, positions of AMD1 stop and AMD1 tail stop are annotated (second row). Supplementary Data 2 Title: Alignment of AMD1 coding region and surrounding areas from 146 vertebrate species. Description: Synonymous and nonsynonymous substitutions are indicated by blue and red colours, respectively, and gaps are in grey. Ka/Ks ratio and sequence identity (see Methods) are shown at the bottom. Supplementary Data 3 Title: Human transcripts with ribosome density profiles similar to AMD1 Description: List of GENCODE transcripts containing peaks of ribosome density downstream and in-frame of protein coding regions. For each transcript information on the chromosome, coordinates, locus, GENCODE ID and the number of footprints are provided in comma delimited format. Supplementary Data 4 Title: Vectors and plasmids Description: Sequences of vectors and plasmids used in this study in fasta format. 16

17 Supplementary Data 5. Title: Genomic sequences of AMD1 coding regions. Description: Genomic sequence of AMD1 coding regions for 146 vertebrate species used in this study in fasta format. Genbank IDs for the source sequences are provided in the comment line for each sequence. Supplementary Data 6. Title: Ribosome profiling datasets used for GWIPS-viz global aggregate tracks Description: Datasets are listed on separate sheets for each genome, first column indicates the publication in which the datasets are described (first author name followed by the year, full reference can be found in GWIPS-viz), second column provides GEO or SRA IDs for each individual dataset from the corresponding study. 17

Tutorial 1: Exploring the UCSC Genome Browser

Last updated: May 12, 2011 Tutorial 1: Exploring the UCSC Genome Browser Open the homepage of the UCSC Genome Browser at: http://genome.ucsc.edu/ In the blue bar at the top, click on the Genomes link.