SUPPLEMENTARY INFORMATION
|
|
- Muriel Wilcox
- 5 years ago
- Views:
Transcription
1 doi: /nature25174 Sequences of DNA primers used in this study. Primer Primer sequence code 498 GTCCAGATCTTGATTAAGAAAAATGAAGAAA F pegfp 499 GTCCAGATCTTGGTTAAGAAAAATGAAGAAA F pegfp 500 GTCCCTGCAGCCTAGAGGGTTAGG R pegfp 495 GTCCCTCGAGAGCCAGACACAA F pdluc-stopgo-emcv IRES 494 GCGTTGCTCGGGCCC R pdluc-stopgo-emcv IRES 496 AACCCCGGGCCCGAGCAACGCTCGCCCCAGAAGATTGAA F pdluc-stopgo-emcv IRES 487 TGAGGCCAACACCTAATGAGGACGAAAGCCTTGT R pdluc-emcv IRES 486 AGATCTTAGAACAGTCCTAGAGGGTTAGGCTGAGGCCAA R pdluc-emcv IRES CACCTAATGA 485 GTCCCTCGAGGAAGCAGCAAC F pdluc-emcv IRES 1710 ATAACTCGAGGAAGCAGCAACAACAGCAGAG F pcdna3-ha 1711 TTATAGATCTATGAGGACGAAAGCCTTGTCTGTGG R pcdna3-ha 614 CTGGAGACATAGCTTACTGG F FLuc qpcr 615 GGAAAGACGATGACGGAA R FLuc qpcr 616 GCGTGACATTAAGGAGAAG F ßActin qpcr 617 AAGGAAGGCTGGAAGAG R ßActin qpcr 444 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTG TCCTACGAGTTGCATG R PCR for mrna synthesis 445 TCGGCCTCTGAGCTATTC F PCR for mrna synthesis 1513 ATAAGCTAGCGAAGCTGCACATTTTTTCGAAGGGACC F AMD1 variant 1 NheI 1514 R TTATCTCGAGTTACTAATGAGGACGAAAGCC AMD1 variant 1 XhoI 1515 F AMD1 variant 1 TGA-TGG AGCAACAACAGCAGAGTTGGTTAAGAAAAATGAAGAAA IFC 1516 R AMD1 variant 1 TGA-TGG TTTCTTCATTTTTCTTAACCAACTCTGCTGTTGTTGCT IFC 1517 F AMD1 variant 1 TGA-TAA AGCAACAACAGCAGAGTTAATTAAGAAAAATGAAGAAA Neg Co R AMD1 variant 1 TGA-TAA TTTCTTCATTTTTCTTAATTAACTCTGCTGTTGTTGCT Neg Co AGCAACAACAGCAGAGTTAATAATTAAGAAAAATGAAG AAA F AMD1 variant 1 TGA- TAATAA Neg Co R AMD1 variant 1 TGA- TTTCTTCATTTTTCTTAATTATTAACTCTGCTGTTGTTGCT TAATAA Neg Co2 781 GGTACTGTTGGTAAAGCCACCATGGCT FOR Renilla qpcr 782 CGACGTGCCTCCACAGGTAGC REV Renilla qpcr 1
2 Python code for identifying peaks of ribosome density in extended ORFs #!/usr/bin/env python # Script to find riboseq peaks between the annotated stop of coding transcripts and the next in frame stop. Any peak overlapping with another annotated CDS # region will be ignored. Takes in the following arguments path_to_gtf_file, path_to_bed_file, path_to_fasta_file, name_of_output_file import sys import shelve from intervaltree import Interval, IntervalTree # Discard genes with downstream peaks lower than this value. MIN_PEAK_COUNT = 500 # Path to gtf file. gtffile = open(sys.argv[1], "r") # Path to bedfile with riboseq counts aligned to the genome. bedfile = open(sys.argv[2],"r") # Path to genomic fasta file. fastafile = open(sys.argv[3],"r") outfile = open(sys.argv[4],"w") outfile.write("chrom,position,tran,gene,max_peak\n") # The following four dictionaries hold tuples corresponding to their names for each chromosome. all_cds = {} all_utr = {} all_other = {} all_genes = {} # Holds the final peak count for each gene. 2
3 master_dict = {} # tree_dict holds the interval trees for each chromosome. There are 3 trees for each corresponding to cds, utr and 'other' regions. tree_dict = {} # gene_dict holds chromosomes as top level keys which point to dictionaries of genes. Each gene points to another dictionary of transcripts which holds information such as readthrough coordinates. gene_dict = {} # top level keys are chromosomes in tran_dict, each chromosome has a dictionary of transcript id's which hold information such as CDS length. tran_dict = {} # fasta_dict holds the nucleotide sequences for each chromosome. fasta_dict = {} # Used to store the positions and counts from the bedfile. footprint_dict = {} # Stores the same info as footprint_dict but with each position annotated as either cds, utr or other. annotated_footprint_dict = {} # This block reads the fasta file putting each header as a key in fasta_dict with it's value as the nucleotide sequence. infasta = fastafile.read() fastafile.close() splitfasta = infasta.split(">") for item in splitfasta[1:]: header = (item.split("\n")[0]).strip(">") nucs_list = (item.split("\n")[1:]) nucs = ("".join(nucs_list)).replace("\n","") 3
4 fasta_dict[header] = nucs # Parses gtf file to populate the tran_dict and the 4 dictionaries all_cds, all_utr, all_other, all_genes for line in gtffile: if line[0]!= '#': splitline = line.split("\t") chrom = splitline[0] annot_type = splitline[2].lower() strand = splitline[6] if strand == "+": start = int(splitline[3]) end = int(splitline[4]) start = int(splitline[3])+1 end = int(splitline[4])+1 strand = splitline[6] desc = splitline[8] if annot_type == "cds": tran = ((desc.split('transcript_id "')[1]).split(".")[0]).split(".")[0] gene = ((desc.split('gene_name "')[1]).split('"')[0]).split(".")[0] if chrom in tran_dict: if tran in tran_dict[chrom]: tran_dict[chrom][tran]["cds"].append((start, end)) tran_dict[chrom][tran] = {"CDS": [(start, end)], "UTR":[], "STRAND":strand, "GENE":gene} tran_dict[chrom] = {tran:{"cds":[(start, end)], "UTR":[],"STRAND":strand, "GENE":gene}} if start!= end: if chrom not in all_genes: all_genes[chrom] = [] 4
5 if chrom not in all_cds: all_cds[chrom] = [] all_genes[chrom].append((gene, strand, start, end)) all_cds[chrom].append((start-1, end+1)) elif annot_type == "stop_codon": if start!= end: if strand == "+": end = end+6 elif strand == "-": start = start-6 if chrom not in all_genes: all_genes[chrom] = [] if chrom not in all_cds: all_cds[chrom] = [] all_genes[chrom].append((gene, strand, start, end)) all_cds[chrom].append((start-1, end+1)) elif annot_type == "utr": tran = ((desc.split('transcript_id "')[1]).split(".")[0]).split(".")[0] gene = ((desc.split('gene_name "')[1]).split('"')[0]).split(".")[0] if chrom in tran_dict: if tran in tran_dict[chrom]: tran_dict[chrom][tran]["utr"].append((start, end)) tran_dict[chrom][tran] = {"CDS": [], "UTR":[(start,end)], "STRAND":strand, "GENE":gene} tran_dict[chrom] = {tran:{"cds":[(start, end)], "UTR":[], "STRAND":strand,"GENE":gene}} if start!= end: try: all_utr[chrom].append((start, end)) 5
6 except: all_utr[chrom] = [(start, end)] if start!= end: try: all_other[chrom].append((start, end)) except: all_other[chrom] = [(start, end)] gtffile.close() # For each coding transcript poplulate gene_dict with information such as cds co-ordinates, strand, readthrough length. for chrom in tran_dict: for transcript in tran_dict[chrom]: if tran_dict[chrom][transcript]["cds"]!= []: total_length = 0 strand = tran_dict[chrom][transcript]["strand"] gene = tran_dict[chrom][transcript]["gene"] cds_point = tran_dict[chrom][transcript]["cds"][0][0] utrs = tran_dict[chrom][transcript]["utr"] for tup in utrs: total_length += (tup[1] - tup[0])+1 for tup in tran_dict[chrom][transcript]["cds"]: total_length += (tup[1] - tup[0])+1 three_trailers = [] fixed_three_trailers = [] three_trailer_len = 0 fixed_cds = [] cds_len = 0 6
7 if strand == "+": for tup in utrs: if tup[0] >cds_point: three_trailers.append(tup) sorted_three_trailers = sorted(three_trailers, key=lambda x: x[0]) # append one to the end of each interval (this is beacause interval trees are non inclusive) for tup in sorted_three_trailers[1:]: fixed_three_trailers.append((tup[0],tup[1]+1)) three_trailer_len += abs(tup[1]-tup[0])+1 if sorted_three_trailers!= []: fixed_three_trailers.append((sorted_three_trailers[0][0]+3,sorted_three_trailers[0][1]+1)) three_trailer_len += abs(sorted_three_trailers[0][0] - sorted_three_trailers[0][1])+1 sorted_cds = sorted(tran_dict[chrom][transcript]["cds"], key=lambda x: x[0]) for tup in sorted_cds[:-1]: cds_len += abs(tup[1]-tup[0])+1 fixed_cds.append((tup[0], tup[1]+1)) if sorted_cds!= []: fixed_cds.append((sorted_cds[-1][0], sorted_cds[-1][1]+4)) cds_len += abs(sorted_cds[-1][0]-sorted_cds[-1][1])+1 genomic_cds_start = sorted_cds[0][0] genomic_cds_stop = sorted_cds[-1][1] cds_stop = total_length-three_trailer_len for tup in utrs: if tup[0] < cds_point: three_trailers.append(tup) sorted_three_trailers = sorted(three_trailers, key=lambda x: x[0]) # append one to the end of each interval (this is beacause interval trees are non inclusive) for tup in sorted_three_trailers[:-1]: 7
8 fixed_three_trailers.append((tup[0]-2,tup[1])) three_trailer_len += abs(tup[1]-tup[0])+1 if sorted_three_trailers!= []: 3)) fixed_three_trailers.append((sorted_three_trailers[-1][0]-2,sorted_three_trailers[-1][1]- three_trailer_len += abs(sorted_three_trailers[-1][0]-sorted_three_trailers[-1][1])+1 sorted_cds = sorted(tran_dict[chrom][transcript]["cds"], key=lambda x: x[0]) for tup in sorted_cds[1:]: cds_len += abs(tup[1]-tup[0])+1 fixed_cds.append((tup[0]-1, tup[1])) if sorted_cds!= []: fixed_cds.append((sorted_cds[0][0]-4, sorted_cds[0][1])) cds_len += abs(sorted_cds[0][0]-sorted_cds[0][1])+1 genomic_cds_start = sorted_cds[-1][1] genomic_cds_stop = sorted_cds[0][0] cds_stop = total_length - three_trailer_len if chrom not in gene_dict: gene_dict[chrom] = {} if gene not in gene_dict[chrom]: gene_dict[chrom][gene] = {} gene_dict[chrom][gene][transcript] = {"CDS":fixed_cds, "3UTR": fixed_three_trailers, "STRAND": strand, "LENGTH":total_length, "THREE_TRAILER_LEN":three_trailer_len, "CDS_LEN":cds_len, "CDS_STOP":cds_stop, "GENOMIC_CDS_START":genomic_cds_start, "GENOMIC_CDS_STOP":genomic_cds_stop} 8
9 # Given a nucleotide sequence returns the reverse complement. def get_comp_seq(inseq): upseq = inseq.upper() lowseq = upseq.replace("a","t").replace("t","a").replace("g","c").replace("c","g") return lowseq.upper() # Find the readthrough co-ordinates for all transcripts. all_stops = ["TAG","TAA","TGA"] for chrom in gene_dict: if chrom not in fasta_dict: print "Skipping chrom {} as it is not present in the fasta file".format(chrom) continue for gene in gene_dict[chrom]: for tran in gene_dict[chrom][gene]: readthrough_intron = False minusone_intron = False plusone_intron = False strand = gene_dict[chrom][gene][tran]["strand"] three_trailers = gene_dict[chrom][gene][tran]["3utr"] three_trailers = sorted(three_trailers, key=lambda x: x[0]) seq = "" for tup in three_trailers: tup_seq = fasta_dict[chrom][tup[0]-1:tup[1]-1] seq+= tup_seq if strand == "-": seq = get_comp_seq(seq[::-1]) fixed_seq = seq[3:] 9
10 readthrough_len = 0 readthrough_coords = [] for i in range(0,len(fixed_seq),3): codon = fixed_seq[i:i+3] readthrough_len +=3 if codon in all_stops: break temp_readthrough_len = readthrough_len if strand == "-": three_trailers = three_trailers[::-1] for tup in three_trailers: tup_len = tup[1] - tup[0] if temp_readthrough_len > tup_len: readthrough_intron = True temp_readthrough_len -= tup_len if strand == "+": readthrough_coords.append((tup[0]+3,tup[1])) elif strand == "-": readthrough_coords.append((tup[1],tup[0]-1)) if readthrough_intron == False: if strand == "+": readthrough_coords.append((tup[0],tup[0]+readthrough_len+2)) elif strand == "-": readthrough_coords.append(((tup[1]-readthrough_len)-3,tup[1]-4)) if strand == "+": readthrough_coords.append((tup[0],tup[0]+readthrough_len+2)) if strand == "-": readthrough_coords.append(((tup[1]-readthrough_len)-3,tup[1]-1)) break 10
11 gene_dict[chrom][gene][tran]["readthrough_coordinates"] = readthrough_coords gene_dict[chrom][gene][tran]["readthrough_len"] = readthrough_len # Create an interval tree for each annotation type (i.e cds, utr, or other) for every chromosome, store these trees in tree_dict. # Interval trees are created to allow for rapidly checking if a given riboseq peak overlaps with a CDS region. for key in all_cds: tree_dict[key] = {"CDS":IntervalTree([Interval(-1, 0)]), "UTR":IntervalTree([Interval(-1, 0)]), "OTHER":IntervalTree([Interval(-1, 0)])} tree = IntervalTree.from_tuples(all_cds[key]) tree_dict[key]["cds"] = tree for key in all_utr: tree = IntervalTree.from_tuples(all_utr[key]) tree_dict[key]["utr"] = tree for key in all_other: tree = IntervalTree.from_tuples(all_other[key]) tree_dict[key]["other"] = tree # Parse the bedfile and put positions and counts in footprint_dict. for line in bedfile: splitline = line.split("\t") chrom = splitline[0] start = int(splitline[1]) # Majority of reads will be in integer format but some are in scientific notation which int() will fail to parse. try: count = int(splitline[3].replace("\n","")) except: count = float(splitline[3].replace("\n","")) 11
12 if chrom not in footprint_dict: footprint_dict[chrom] = [] footprint_dict[chrom].append((start, count)) bedfile.close() # For each count in footprint dict use the interval trees to check if it overlaps with a CDS, UTR, OTHER or INTERGENIC region and add give it the corresponding label in annotated_footprint_dict. for chrom in footprint_dict: if chrom not in tree_dict.keys(): continue # Create several lists which reads will be recursively placed in if they do not match the current category. footprint_list = footprint_dict[chrom] subfootprint_list = [] subtwofootprint_list = [] subthreefootprint_list = [] for tup in footprint_list: position, count = tup if tree_dict[chrom]["cds"].overlaps(position) == True: try: annotated_footprint_dict[chrom]["cds"][position] = count except: annotated_footprint_dict[chrom] = {"CDS":{}, "UTR":{}, "OTHER":{}, "INTERGENIC":{}} annotated_footprint_dict[chrom]["cds"][position] = count subfootprint_list.append(tup) for tup in subfootprint_list: position, count = tup if tree_dict[chrom]["utr"].overlaps(position) == True: annotated_footprint_dict[chrom]["utr"][position] = count 12
13 subtwofootprint_list.append(tup) for tup in subtwofootprint_list: position, count = tup if tree_dict[chrom]["other"].overlaps(position) == True: annotated_footprint_dict[chrom]["other"][position] = count subthreefootprint_list.append(tup) for tup in subthreefootprint_list: position, count = tup annotated_footprint_dict[chrom]["intergenic"][position] = count # For each gene find the highest riboseq peak between the annotated stop and next inframe stop that does not overlap with another annotated CDS region. for chrom in gene_dict: if chrom not in annotated_footprint_dict.keys(): print "Chrom {} is not in annotated_footprint_dict, skipping".format(chrom) continue for gene in gene_dict[chrom]: # For cases with multiple transcripts only pick the one with the longest 3' trailer, unless the transcripts have different annotated stop codons. accepted_trans = {} genomic_stops = [] for tran in gene_dict[chrom][gene]: genomic_cds_stop = gene_dict[chrom][gene][tran]["genomic_cds_stop"] three_utr_len = gene_dict[chrom][gene][tran]["three_trailer_len"] if genomic_cds_stop in accepted_trans: if three_utr_len > accepted_trans[genomic_cds_stop][0]: accepted_trans[genomic_cds_stop] = [three_utr_len, tran] accepted_trans[genomic_cds_stop] = [three_utr_len, tran] accepted_tran_list = [] 13
14 for key in accepted_trans: accepted_tran_list.append(accepted_trans[key][1]) # For all accepted transcripts find the highest riboseq peak in the readthrough co-ordinates using counts that have been annotated as UTR. for tran in accepted_tran_list: temp_dict = {} if tran not in gene_dict[chrom][gene]: print "tran not in gene_dict" continue if "READTHROUGH_COORDINATES" not in gene_dict[chrom][gene][tran]: print "skipping transcript {} for gene {}".format(tran, gene) continue readthrough_coords = gene_dict[chrom][gene][tran]["readthrough_coordinates"] strand = gene_dict[chrom][gene][tran]["strand"] for tup in gene_dict[chrom][gene][tran]["3utr"]: for i in range(tup[0], tup[1]): if i in annotated_footprint_dict[chrom]["utr"]: max_rt = 0 temp_dict[i] = annotated_footprint_dict[chrom]["utr"][i] max_rt_pos = 0 for tup in readthrough_coords: for i in range(tup[0], tup[1]): if i in temp_dict: if temp_dict[i] > max_rt: max_rt = temp_dict[i] max_rt_pos = i # Correct for 0 based co-ordinates max_rt_pos = max_rt_pos+1 # Add this gene to master_dict, unless it has already been added in which case replace it only if the max peak position for this transcript is higher. if gene not in master_dict: 14
15 master_dict[gene] = {"chrom":chrom, "position":max_rt_pos, "gene":gene, "transcript":tran, "max_peak":max_rt} if max_rt > master_dict[gene]["max_peak"]: master_dict[gene] = {"chrom":chrom, "position":max_rt_pos, "gene":gene, "transcript":tran, "max_peak":max_rt} # Sort the master_dict from highest to lowest max peak count, if max peak count is greater than MIN_PEAK_COUNT then write it to outfile. for gene in sorted(master_dict.keys(), key=lambda x: (master_dict[x]["max_peak"]), reverse=true): if master_dict[gene]["max_peak"] > MIN_PEAK_COUNT: outfile.write("{},{},{},{},{}\n".format(master_dict[gene]["chrom"], master_dict[gene]["position"], master_dict[gene]["transcript"], gene, master_dict[gene]["max_peak"])) 15
16 Guide to individual supplementary items Supplementary Figure 1. Title: Source Data (Gels). Description: Original source Images of the gels that have been used for making figures with weight markers. Cropped parts are indicated. Supplementary Information Title: Supplementary Information Description: Information on (i) oligonucleotides used, (ii) python code and a (iii) guide to additional supplementary items. Supplementary Data 1 Title: Genomic alignment of tetrapods from UCSC Genome browser 100 species alignment. Description: Codon alignment obtained with CodAlignView, positions of AMD1 stop and AMD1 tail stop are annotated (second row). Supplementary Data 2 Title: Alignment of AMD1 coding region and surrounding areas from 146 vertebrate species. Description: Synonymous and nonsynonymous substitutions are indicated by blue and red colours, respectively, and gaps are in grey. Ka/Ks ratio and sequence identity (see Methods) are shown at the bottom. Supplementary Data 3 Title: Human transcripts with ribosome density profiles similar to AMD1 Description: List of GENCODE transcripts containing peaks of ribosome density downstream and in-frame of protein coding regions. For each transcript information on the chromosome, coordinates, locus, GENCODE ID and the number of footprints are provided in comma delimited format. Supplementary Data 4 Title: Vectors and plasmids Description: Sequences of vectors and plasmids used in this study in fasta format. 16
17 Supplementary Data 5. Title: Genomic sequences of AMD1 coding regions. Description: Genomic sequence of AMD1 coding regions for 146 vertebrate species used in this study in fasta format. Genbank IDs for the source sequences are provided in the comment line for each sequence. Supplementary Data 6. Title: Ribosome profiling datasets used for GWIPS-viz global aggregate tracks Description: Datasets are listed on separate sheets for each genome, first column indicates the publication in which the datasets are described (first author name followed by the year, full reference can be found in GWIPS-viz), second column provides GEO or SRA IDs for each individual dataset from the corresponding study. 17
Tutorial 1: Exploring the UCSC Genome Browser
Last updated: May 12, 2011 Tutorial 1: Exploring the UCSC Genome Browser Open the homepage of the UCSC Genome Browser at: http://genome.ucsc.edu/ In the blue bar at the top, click on the Genomes link.
More informationAdvanced UCSC Browser Functions
Advanced UCSC Browser Functions Dr. Thomas Randall tarandal@email.unc.edu bioinformatics.unc.edu UCSC Browser: genome.ucsc.edu Overview Custom Tracks adding your own datasets Utilities custom tools for
More informationUseful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017
Useful software utilities for computational genomics Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017 Overview Search and download genomic datasets: GEOquery, GEOsearch and GEOmetadb,
More informationpyensembl Documentation
pyensembl Documentation Release 0.8.10 Hammer Lab Oct 30, 2017 Contents 1 pyensembl 3 1.1 pyensembl package............................................ 3 2 Indices and tables 25 Python Module Index 27
More informationSupplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.
Supplementary Figure 1 Fast read-mapping algorithm of BrowserGenome. (a) Indexing strategy: The genome sequence of interest is divided into non-overlapping 12-mers. A Hook table is generated that contains
More informationGenome Browsers - The UCSC Genome Browser
Genome Browsers - The UCSC Genome Browser Background The UCSC Genome Browser is a well-curated site that provides users with a view of gene or sequence information in genomic context for a specific species,
More informationBMMB 597D - Practical Data Analysis for Life Scientists. Week 12 -Lecture 23. István Albert Huck Institutes for the Life Sciences
BMMB 597D - Practical Data Analysis for Life Scientists Week 12 -Lecture 23 István Albert Huck Institutes for the Life Sciences Tapping into data sources Entrez: Cross-Database Search System EntrezGlobal
More informationm6aviewer Version Documentation
m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.
More informationBrowser Exercises - I. Alignments and Comparative genomics
Browser Exercises - I Alignments and Comparative genomics 1. Navigating to the Genome Browser (GBrowse) Note: For this exercise use http://www.tritrypdb.org a. Navigate to the Genome Browser (GBrowse)
More informationAnalyzing ChIP- Seq Data in Galaxy
Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...
More informationCreating and Using Genome Assemblies Tutorial
Creating and Using Genome Assemblies Tutorial Release 8.1 Golden Helix, Inc. March 18, 2014 Contents 1. Create a Genome Assembly for Danio rerio 2 2. Building Annotation Sources 5 A. Creating a Reference
More informationGlimmer Release Notes Version 3.01 (Beta) Arthur L. Delcher
Glimmer Release Notes Version 3.01 (Beta) Arthur L. Delcher 10 October 2005 1 Introduction This document describes Version 3 of the Glimmer gene-finding software. This version incorporates a nearly complete
More informationChIP-seq (NGS) Data Formats
ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/
More informationWilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment
An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi
More informationWorking with files. File Reading and Writing. Reading and writing. Opening a file
Working with files File Reading and Writing Reading get info into your program Parsing processing file contents Writing get info out of your program MBV-INFx410 Fall 2014 Reading and writing Three-step
More informationSequence Alignment. GBIO0002 Archana Bhardwaj University of Liege
Sequence Alignment GBIO0002 Archana Bhardwaj University of Liege 1 What is Sequence Alignment? A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.
More informationData Walkthrough: Background
Data Walkthrough: Background File Types FASTA Files FASTA files are text-based representations of genetic information. They can contain nucleotide or amino acid sequences. For this activity, students will
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2019 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationOverview. Dataset: testpos DNA: CCCATGGTCGGGGGGGGGGAGTCCATAACCC Num exons: 2 strand: + RNA (from file): AUGGUCAGUCCAUAA peptide (from file): MVSP*
Overview In this homework, we will write a program that will print the peptide (a string of amino acids) from four pieces of information: A DNA sequence (a string). The strand the gene appears on (a string).
More informationQIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL
QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL User manual for QIAseq Targeted RNAscan Panel Analysis 0.5.2 beta 1 Windows, Mac OS X and Linux February 5, 2018 This software is for research
More informationSOLiD GFF File Format
SOLiD GFF File Format 1 Introduction The GFF file is a text based repository and contains data and analysis results; colorspace calls, quality values (QV) and variant annotations. The inputs to the GFF
More informationIntro to NGS Tutorial
Intro to NGS Tutorial Release 8.6.0 Golden Helix, Inc. October 31, 2016 Contents 1. Overview 2 2. Import Variants and Quality Fields 3 3. Quality Filters 10 Generate Alternate Read Ratio.........................................
More informationPackage customprodb. September 9, 2018
Type Package Package customprodb September 9, 2018 Title Generate customized protein database from NGS data, with a focus on RNA-Seq data, for proteomics search Version 1.20.2 Date 2018-08-08 Author Maintainer
More information3. Open Vector NTI 9 (note 2) from desktop. A three pane window appears.
SOP: SP043.. Recombinant Plasmid Map Design Vector NTI Materials and Reagents: 1. Dell Dimension XPS T450 Room C210 2. Vector NTI 9 application, on desktop 3. Tuberculist database open in Internet Explorer
More informationThe UCSC Gene Sorter, Table Browser & Custom Tracks
The UCSC Gene Sorter, Table Browser & Custom Tracks Advanced searching and discovery using the UCSC Table Browser and Custom Tracks Osvaldo Graña Bioinformatics Unit, CNIO 1 Table Browser and Custom Tracks
More informationLecture 12. Short read aligners
Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola
More informationGenomic Analysis with Genome Browsers.
Genomic Analysis with Genome Browsers http://barc.wi.mit.edu/hot_topics/ 1 Outline Genome browsers overview UCSC Genome Browser Navigating: View your list of regions in the browser Available tracks (eg.
More informationEval: A Gene Set Comparison System
Masters Project Report Eval: A Gene Set Comparison System Evan Keibler evan@cse.wustl.edu Table of Contents Table of Contents... - 2 - Chapter 1: Introduction... - 5-1.1 Gene Structure... - 5-1.2 Gene
More informationpanda Documentation Release 1.0 Daniel Vera
panda Documentation Release 1.0 Daniel Vera February 12, 2014 Contents 1 mat.make 3 1.1 Usage and option summary....................................... 3 1.2 Arguments................................................
More informationUnderstanding the content of HyPhy s JSON output files
Understanding the content of HyPhy s JSON output files Stephanie J. Spielman July 2018 Most standard analyses in HyPhy output results in JSON format, essentially a nested dictionary. This page describes
More informationDatabase Searching Using BLAST
Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain
More informationAnalyzing Variant Call results using EuPathDB Galaxy, Part II
Analyzing Variant Call results using EuPathDB Galaxy, Part II In this exercise, we will work in groups to examine the results from the SNP analysis workflow that we started yesterday. The first step is
More informationWorking with files. File Reading and Writing. Reading and writing. Opening a file
Working with files File Reading and Writing Reading get info into your program Parsing processing file contents Writing get info out of your program MBV-INFx410 Fall 2015 Reading and writing Three-step
More informationHymenopteraMine Documentation
HymenopteraMine Documentation Release 1.0 Aditi Tayal, Deepak Unni, Colin Diesh, Chris Elsik, Darren Hagen Apr 06, 2017 Contents 1 Welcome to HymenopteraMine 3 1.1 Overview of HymenopteraMine.....................................
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationMIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September
MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September 27 2014 Static Dynamic Static Minimum Information for Reporting
More informationWilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST
A Simple Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at http://www.ncbi.nih.gov/blast/
More informationTiling Assembly for Annotation-independent Novel Gene Discovery
Tiling Assembly for Annotation-independent Novel Gene Discovery By Jennifer Lopez and Kenneth Watanabe Last edited on September 7, 2015 by Kenneth Watanabe The following procedure explains how to run the
More informationMapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6
Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis
More informationRNA- SeQC Documentation
RNA- SeQC Documentation Description: Author: Calculates metrics on aligned RNA-seq data. David S. DeLuca (Broad Institute), gp-help@broadinstitute.org Summary This module calculates standard RNA-seq related
More informationNature Biotechnology: doi: /nbt Supplementary Figure 1
Supplementary Figure 1 Detailed schematic representation of SuRE methodology. See Methods for detailed description. a. Size-selected and A-tailed random fragments ( queries ) of the human genome are inserted
More informationWhen we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame
1 When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from
More informationGenomics 92 (2008) Contents lists available at ScienceDirect. Genomics. journal homepage:
Genomics 92 (2008) 75 84 Contents lists available at ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno Review UCSC genome browser tutorial Ann S. Zweig a,, Donna Karolchik a, Robert
More informationDesign and Annotation Files
Design and Annotation Files Release Notes SeqCap EZ Exome Target Enrichment System The design and annotation files provide information about genomic regions covered by the capture probes and the genes
More informationThis module contains three plugins: Decouple.pl, Add.pl and Delete.pl.
NeoChr NeoChr is used to construct new chromosome denovo. It would assist users to grab related genes in different pathways of various organism manually, to rewire genes relationship logically*, and to
More informationChromatin immunoprecipitation sequencing (ChIP-Seq) on the SOLiD system Nature Methods 6, (2009)
ChIP-seq Chromatin immunoprecipitation (ChIP) is a technique for identifying and characterizing elements in protein-dna interactions involved in gene regulation or chromatin organization. www.illumina.com
More informationUCSC Genome Browser Pittsburgh Workshop -- Practical Exercises
UCSC Genome Browser Pittsburgh Workshop -- Practical Exercises We will be using human assembly hg19. These problems will take you through a variety of resources at the UCSC Genome Browser. You will learn
More informationCommand-Line Data Analysis INX_S17, Day 10,
Command-Line Data Analysis INX_S17, Day 10, 2017-05-01 Assignment 4 (quiz). sort, head, tail Learning Outcome(s): Use `sort` to build filtering pipelines for bioinformatics data Matthew Peterson, OSU CGRB,
More informationThe UCSC Genome Browser
The UCSC Genome Browser Search, retrieve and display the data that you want Materials prepared by Warren C. Lathe, Ph.D. Mary Mangan, Ph.D. www.openhelix.com Updated: Q3 2006 Version_0906 Copyright OpenHelix.
More informationExercise 2: Browser-Based Annotation and RNA-Seq Data
Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence
More informationPackage LncFinder. February 6, 2017
Type Package Package LncFinder February 6, 2017 Title Long Non-Coding RNA Identification Based on Features of Sequence, EIIP and Secondary Structure Version 1.0.0 Author Han Siyu [aut, cre], Li Ying [aut],
More informationLong Read RNA-seq Mapper
UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...
More information1. Introduction Supported data formats/arrays Aligned BAM files How to load and open files Affymetrix files...
How to import data 1. Introduction... 2 2. Supported data formats/arrays... 2 3. Aligned BAM files... 3 4. How to load and open files... 3 5. Affymetrix files... 4 5.1 Affymetrix CEL files (.cel)... 4
More informationAssignment 6: Motif Finding Bio5488 2/24/17. Slide Credits: Nicole Rockweiler
Assignment 6: Motif Finding Bio5488 2/24/17 Slide Credits: Nicole Rockweiler Assignment 6: Motif finding Input Promoter sequences PWMs of DNA-binding proteins Goal Find putative binding sites in the sequences
More informationSupplementary information: Detection of differentially expressed segments in tiling array data
Supplementary information: Detection of differentially expressed segments in tiling array data Christian Otto 1,2, Kristin Reiche 3,1,4, Jörg Hackermüller 3,1,4 July 1, 212 1 Bioinformatics Group, Department
More informationFusion Detection Using QIAseq RNAscan Panels
Fusion Detection Using QIAseq RNAscan Panels June 11, 2018 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com ts-bioinformatics@qiagen.com
More informationPart 1: How to use IGV to visualize variants
Using IGV to identify true somatic variants from the false variants http://www.broadinstitute.org/igv A FAQ, sample files and a user guide are available on IGV website If you use IGV in your publication:
More informationQIAseq DNA V3 Panel Analysis Plugin USER MANUAL
QIAseq DNA V3 Panel Analysis Plugin USER MANUAL User manual for QIAseq DNA V3 Panel Analysis 1.0.1 Windows, Mac OS X and Linux January 25, 2018 This software is for research purposes only. QIAGEN Aarhus
More informationFrom genomic regions to biology
Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and
More informationUploading sequences to GenBank
A primer for practical phylogenetic data gathering. Uconn EEB3899-007. Spring 2015 Session 5 Uploading sequences to GenBank Rafael Medina (rafael.medina.bry@gmail.com) Yang Liu (yang.liu@uconn.edu) confirmation
More informationGenomics - Problem Set 2 Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am
Genomics - Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am One major aspect of functional genomics is measuring the transcript abundance of all genes simultaneously. This was
More informationTutorial: chloroplast genomes
Tutorial: chloroplast genomes Stacia Wyman Department of Computer Sciences Williams College Williamstown, MA 01267 March 10, 2005 ASSUMPTIONS: You are using Internet Explorer under OS X on the Mac. You
More informationSupplementary Material. Cell type-specific termination of transcription by transposable element sequences
Supplementary Material Cell type-specific termination of transcription by transposable element sequences Andrew B. Conley and I. King Jordan Controls for TTS identification using PET A series of controls
More information4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-
1. PURPOSE To provide instructions for finding rs Numbers (SNP database ID numbers) and increasing sequence length by utilizing the UCSC Genome Bioinformatics Database. 2. MATERIALS 2.1. Sequence Information
More informationIntroduction to Genome Browsers
Introduction to Genome Browsers Rolando Garcia-Milian, MLS, AHIP (Rolando.milian@ufl.edu) Department of Biomedical and Health Information Services Health Sciences Center Libraries, University of Florida
More informationGenomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am
Genomics - Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am One major aspect of functional genomics is measuring the transcript abundance of all genes simultaneously. This was
More informationTutorial: Jump Start on the Human Epigenome Browser at Washington University
Tutorial: Jump Start on the Human Epigenome Browser at Washington University This brief tutorial aims to introduce some of the basic features of the Human Epigenome Browser, allowing users to navigate
More informationUSING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)
USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are
More informationA short Introduction to UCSC Genome Browser
A short Introduction to UCSC Genome Browser Elodie Girard, Nicolas Servant Institut Curie/INSERM U900 Bioinformatics, Biostatistics, Epidemiology and computational Systems Biology of Cancer 1 Why using
More informationIntroduction to Bioinformatics Problem Set 3: Genome Sequencing
Introduction to Bioinformatics Problem Set 3: Genome Sequencing 1. Assemble a sequence with your bare hands! You are trying to determine the DNA sequence of a very (very) small plasmids, which you estimate
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive
More informationUser's guide to ChIP-Seq applications: command-line usage and option summary
User's guide to ChIP-Seq applications: command-line usage and option summary 1. Basics about the ChIP-Seq Tools The ChIP-Seq software provides a set of tools performing common genome-wide ChIPseq analysis
More informationTn-seq Explorer 1.2. User guide
Tn-seq Explorer 1.2 User guide 1. The purpose of Tn-seq Explorer Tn-seq Explorer allows users to explore and analyze Tn-seq data for prokaryotic (bacterial or archaeal) genomes. It implements a methodology
More informationMinimum Information for Reporting Immunogenomic NGS Genotyping (MIRING)
Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) Reporting guideline statement for HLA and KIR genotyping data generated via Next Generation Sequencing (NGS) technologies and analysis
More informationGenome Browsers Guide
Genome Browsers Guide Take a Class This guide supports the Galter Library class called Genome Browsers. See our Classes schedule for the next available offering. If this class is not on our upcoming schedule,
More informationHIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)
HIPPIE User Manual (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu) OVERVIEW OF HIPPIE o Flowchart of HIPPIE o Requirements PREPARE DIRECTORY STRUCTURE FOR HIPPIE EXECUTION o
More informationCLC Server. End User USER MANUAL
CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark
More informationGenome Environment Browser (GEB) user guide
Genome Environment Browser (GEB) user guide GEB is a Java application developed to provide a dynamic graphical interface to visualise the distribution of genome features and chromosome-wide experimental
More informationComputational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 -
Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 - Benjamin King Mount Desert Island Biological Laboratory bking@mdibl.org Overview of 4 Lectures Introduction to Computation
More informationBiocomputing II Coursework guidance
Biocomputing II Coursework guidance I refer to the database layer as DB, the middle (business logic) layer as BL and the front end graphical interface with CGI scripts as (FE). Standardized file headers
More informationPackage igc. February 10, 2018
Type Package Package igc February 10, 2018 Title An integrated analysis package of Gene expression and Copy number alteration Version 1.8.0 This package is intended to identify differentially expressed
More informationGetting Started. April Strand Life Sciences, Inc All rights reserved.
Getting Started April 2015 Strand Life Sciences, Inc. 2015. All rights reserved. Contents Aim... 3 Demo Project and User Interface... 3 Downloading Annotations... 4 Project and Experiment Creation... 6
More informationProgramming Applications. What is Computer Programming?
Programming Applications What is Computer Programming? An algorithm is a series of steps for solving a problem A programming language is a way to express our algorithm to a computer Programming is the
More informationIntroduction to Galaxy
Introduction to Galaxy Dr Jason Wong Prince of Wales Clinical School Introductory bioinformatics for human genomics workshop, UNSW Day 1 Thurs 28 th January 2016 Overview What is Galaxy? Description of
More informationPractical Course in Genome Bioinformatics
Practical Course in Genome Bioinformatics 20/01/2017 Exercises - Day 1 http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/ Answer questions Q1-Q3 below and include requested Figures 1-5
More informationTutorial MAJIQ/Voila (v1.1.x)
Tutorial MAJIQ/Voila (v1.1.x) Introduction What are MAJIQ and Voila? What is MAJIQ? What MAJIQ is not What is Voila? How to cite us? Quick start Pre MAJIQ MAJIQ Builder Outlier detection PSI Analysis Delta
More informationVectorBase Web Apollo April Web Apollo 1
Web Apollo 1 Contents 1. Access points: Web Apollo, Genome Browser and BLAST 2. How to identify genes that need to be annotated? 3. Gene manual annotations 4. Metadata 1. Access points Web Apollo tool
More informationTutorial. Find Very Low Frequency Variants With QIAGEN GeneRead Panels. Sample to Insight. November 21, 2017
Find Very Low Frequency Variants With QIAGEN GeneRead Panels November 21, 2017 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com
More informationDr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata
Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis
More informationPython review. 1 Python basics. References. CS 234 Naomi Nishimura
Python review CS 234 Naomi Nishimura The sections below indicate Python material, the degree to which it will be used in the course, and various resources you can use to review the material. You are not
More informationHandling sam and vcf data, quality control
Handling sam and vcf data, quality control We continue with the earlier analyses and get some new data: cd ~/session_3 wget http://wasabiapp.org/vbox/data/session_4/file3.tgz tar xzf file3.tgz wget http://wasabiapp.org/vbox/data/session_4/file4.tgz
More informationTutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017
RNA-Seq Analysis of Breast Cancer Data November 21, 2017 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com AdvancedGenomicsSupport@qiagen.com
More informationPICS: Probabilistic Inference for ChIP-Seq
PICS: Probabilistic Inference for ChIP-Seq Xuekui Zhang * and Raphael Gottardo, Arnaud Droit and Renan Sauteraud April 30, 2018 A step-by-step guide in the analysis of ChIP-Seq data using the PICS package
More informationFinding Selection in All the Right Places TA Notes and Key Lab 9
Objectives: Finding Selection in All the Right Places TA Notes and Key Lab 9 1. Use published genome data to look for evidence of selection in individual genes. 2. Understand the need for DNA sequence
More informationTutorial. Small RNA Analysis using Illumina Data. Sample to Insight. October 5, 2016
Small RNA Analysis using Illumina Data October 5, 2016 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com AdvancedGenomicsSupport@qiagen.com
More informationASAP - Allele-specific alignment pipeline
ASAP - Allele-specific alignment pipeline Jan 09, 2012 (1) ASAP - Quick Reference ASAP needs a working version of Perl and is run from the command line. Furthermore, Bowtie needs to be installed on your
More informationBovineMine Documentation
BovineMine Documentation Release 1.0 Deepak Unni, Aditi Tayal, Colin Diesh, Christine Elsik, Darren Hag Oct 06, 2017 Contents 1 Tutorial 3 1.1 Overview.................................................
More informationSmall RNA Analysis using Illumina Data
Small RNA Analysis using Illumina Data September 7, 2016 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com
More informationGenetics 211 Genomics Winter 2014 Problem Set 4
Genomics - Part 1 due Friday, 2/21/2014 by 9:00am Part 2 due Friday, 3/7/2014 by 9:00am For this problem set, we re going to use real data from a high-throughput sequencing project to look for differential
More informationSequence Analysis Pipeline
Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation
More information