Genome representa;on concepts. Week 12, Lecture 24. Coordinate systems. Genomic coordinates brief overview 11/13/14

Similar documents
Bioinformatics in next generation sequencing projects

A short Introduction to UCSC Genome Browser

Advanced UCSC Browser Functions

ChIP-seq (NGS) Data Formats

The BEDTools manual. Aaron R. Quinlan and Ira M. Hall University of Virginia. Contact:

Integra(ve Genomics Viewer IGV. Tom Carroll MRC Clinical Sciences Centre

NGS Analysis Using Galaxy

The BEDTools manual. Last updated: 21-September-2010 Current as of BEDTools version Aaron R. Quinlan and Ira M. Hall University of Virginia

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

- 1 - Web page:

Galaxy Platform For NGS Data Analyses

Goal: Learn how to use various tool to extract information from RNAseq reads.

Week 9, Lecture 17. Some Programming Required. Programming Languages. Turing Completeness 10/20/15

Genetics 211 Genomics Winter 2014 Problem Set 4

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Practical Bioinformatics for Life Scientists. Week 4, Lecture 8. István Albert Bioinformatics Consulting Center Penn State

NGS Data Visualization and Exploration Using IGV

HymenopteraMine Documentation

User's guide to ChIP-Seq applications: command-line usage and option summary

BovineMine Documentation

Genomics tools: making quickly impressive outputs

Genome Browsers - The UCSC Genome Browser

Short Read Sequencing Analysis Workshop

Analyzing Variant Call results using EuPathDB Galaxy, Part II

CLC Server. End User USER MANUAL

GROK User Guide August 28, 2013

Part 1: How to use IGV to visualize variants

The UCSC Gene Sorter, Table Browser & Custom Tracks

panda Documentation Release 1.0 Daniel Vera

Finding and Exporting Data. BioMart

How to use earray to create custom content for the SureSelect Target Enrichment platform. Page 1

Relationship Between BED and WIG Formats

Design and Annotation Files

RNA-Seq Analysis With the Tuxedo Suite

Genome Browsers Guide

Read Mapping and Variant Calling

Eval: A Gene Set Comparison System

Short Read Sequencing Analysis Workshop

Sequence Alignment: Mo1va1on and Algorithms. Lecture 2: August 23, 2012

Welcome to GenomeView 101!

Today's outline. Resources. Genome browser components. Genome browsers: Discovering biology through genomics. Genome browser tutorial materials

Handling important NGS data formats in UNIX Prac8cal training course NGS Workshop in Nove Hrady 2014

CISC327 - So*ware Quality Assurance

m6aviewer Version Documentation

Advanced genome browsers: Integrated Genome Browser and others Heiko Muller Computational Research

merantk Version 1.1.1a

Roadmap. The Interface CSE351 Winter Frac3onal Binary Numbers. Today s Topics. Floa3ng- Point Numbers. What is ?

Briefly: Bioinformatics File Formats. J Fass September 2018

ChIP-seq hands-on practical using Galaxy

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Introduction to Galaxy

Genomic Analysis with Genome Browsers.

Rsubread package: high-performance read alignment, quantification and mutation discovery

GEP Project Management System: Annotation Project Submission

CS101: Fundamentals of Computer Programming. Dr. Tejada www-bcf.usc.edu/~stejada Week 1 Basic Elements of C++

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Floa.ng Point : Introduc;on to Computer Systems 4 th Lecture, Sep. 10, Instructors: Randal E. Bryant and David R.

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September

Rsubread package: high-performance read alignment, quantification and mutation discovery

RNAseqViewer Documentation

From genomic regions to biology

Applications of a generic model of genomic variations functional analysis

Tutorial: Building up and querying databases

Tutorial: Jump Start on the Human Epigenome Browser at Washington University

Long Read RNA-seq Mapper

Fusion Detection Using QIAseq RNAscan Panels

Package Rsubread. July 21, 2013

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

Objec0ves. Gain understanding of what IDA Pro is and what it can do. Expose students to the tool GUI

Introduc)on to annota)on with Artemis. Download presenta.on and data

Tutorial 1: Exploring the UCSC Genome Browser

Sequence Alignment: Mo1va1on and Algorithms

Visualizing RNA- Seq Differen5al Expression Results with CummeRbund

ChIP-Seq Tutorial on Galaxy

Overview. Dataset: testpos DNA: CCCATGGTCGGGGGGGGGGAGTCCATAACCC Num exons: 2 strand: + RNA (from file): AUGGUCAGUCCAUAA peptide (from file): MVSP*

Read Mapping. Slides by Carl Kingsford

Creating and Using Genome Assemblies Tutorial

SPAR outputs and report page

Package refgenome. March 20, 2017

Miniproject 1. Part 1 Due: 16 February. The coverage problem. Method. Why it is hard. Data. Task1

Package GenomicFeatures

Integrative Genomics Viewer. Prat Thiru

For Research Use Only. Not for use in diagnostic procedures.

Table of contents Genomatix AG 1

GenomeStudio Software Release Notes

Practical Course in Genome Bioinformatics

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Bedtools Documentation

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Tiling Assembly for Annotation-independent Novel Gene Discovery

Today s Class. High Dimensional Data & Dimensionality Reduc8on. Readings for This Week: Today s Class. Scien8fic Data. Misc. Personal Data 2/22/12

NGS Data and Sequence Alignment

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Subread/Rsubread Users Guide

High-throughout sequencing and using short-read aligners. Simon Anders

Design and Debug: Essen.al Concepts Numerical Conversions CS 16: Solving Problems with Computers Lecture #7

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

Our Task At Hand Aggregate data from every group

Transcription:

2014 - BMMB 852D: Applied Bioinforma;cs Week 12, Lecture 24 István Albert Biochemistry and Molecular Biology and Bioinforma;cs Consul;ng Center Penn State Genome representa;on concepts At the simplest level of abstrac;on the genome is represented by a one dimensional space (lines) Genome is two stranded à a line corresponds to each strand Each strand has a polarity à each line has a direc;on Strands (lines) are paired The smallest unit is one base à one integer on the number line Annota;ons (features) are segments (coordinates) on each line Genomic coordinates brief overview upstream for the forward strand DNA two stranded and direc;onal But there is only one coordinate system 5 3 3 200 300 upstream for the reverse strand Standard formats use start < end even for the reverse strand 5 Coordinate systems 0 based à 0, 1, 2, 9 1 based à 1, 2, 3, 10 Typically 0 based are non- inclusive 10:20 à [ 10, 20 ) 1 based include both ends 10:20 à [ 10, 20 ] The upstream region before the 5 end rela;ve to the direc;on of transcrip;on 1

Comparing coordinate systems Fundamental interval formats Third element First ten Second ten Third ten One base long interval star;ng at 10 1 based indexing 0 based indexing SAM/BAM Sequence Alignment Map VCF/BCF à for variant calls Length of an interval BED/GFF à Gene Annota;on representa;on Five elements star;ng at 1000 Empty interval Vote for what you think is bezer BEDgraph, Wiggle à values over intervals What is a genomic feature? Feature: a genomic region (interval) associated with a certain annota;on (descrip;on). 1. chromosome 2. start 3. end 4. strand 5. name Typical azributes to describe a feature Values on intervals A single value characterizes an en;re interval à score (value) for the interval Con;nuous values à different value for each base of the interval à analogous to a series of 1bp long intervals Different data representa;on formats Why do we have so many variants? There is no good ra;onal reason history I guess 2

hzp://genome.ucsc.edu/faq/faqformat.html Two commonly used formats BED UCSC genome browser à 0 based non inclusive à also used to display tracks in the genome browser (US standard ) (variants: bigbed, bedgraph) GFF Sanger ins;tute in Great Britain à 1 based inclusive indexing system ( European standard ), (variants: GTF, GFF 2.0) BED format Search for BED format Tab separated 3 required and 9 op;onal columns. Lower numbered filed must be filled. 1. chrom (name of the chromosome, sequence id) 2. chromstart (star;ng posi;on on the chromosome) 3. chromend (end posi;on of the chromosome, note this base is not included!) 4. name (feature name) 5. score (between 0 and 1000) 6. strand (+ or - ) 7. thickstart (the star;ng posi;on at which the feature is drawn thickly) 8. thickend (the ending posi;on at which the feature is drawn thickly) 9. itemrgb (RGB color à 255, 0, 0 display color of the data contained) 10. blockcount (the number of blocks (exons) in the BED line.) 11. blocksizes (a comma- separated list of the block sizes) 12. blockstarts (a comma- separated list of the block starts) Tab separated 4 required columns. BedGraph Format 1. chrom (name of the chromosome, sequence id) 2. chromstart (star;ng posi;on on the chromosome) 3. chromend (end posi;on of the chromosome, note this base is not included!) 4. datavalue (value of the data for that region) These files may also take a track defini;on line that is visualiza;on specific 3

GFF format Search for GFF3 à hzp://www.sequenceontology.org/gff3.shtml Tab separated with 9 columns. Missing azributes may be replaced with a dot à. 1. Seqid (usually chromosome) 2. Source (where is the data coming from) 3. Type (usually a term from the sequence ontology) 4. Start (interval start rela;ve to the seqid) 5. End (interval end rela;ve to the seqid) 6. Score (the score of the feature, a floa;ng point number) 7. Strand (+ or ) 8. Phase (used to indicate reading frame for coding sequences) 9. AZributes (semicolon separated azributes à Name=ABC;ID=1) people like to stuff a lot of informa;on here Wiggle format two versions à fixed step and variable step each trying to op;mize the amount of data storage fixedstep chrom=chr1 start=100 step=1 10 15 variablestep chrom=chr1 11 100 10 22 101 15 102 11 103 22 variablestep chrom=chr2 2000 23 2005 40 Wiggle is an nasty format it looks simpler than it is please avoid We may have data in different coordinate systems! Being one off is one of the most common errors in bioinforma;cs. Conversion from GFF to BED (start, end) à (start 1, end) Conversion from BED to GFF (start, end) à (start + 1, end) Not that there will be differences when selecang posiaons that depend on the END coordinate! Handling coordinates rela;ve to intervals What are the coordinate of the base preceding and following the interval GFF [start, end] à preceding base is at start - 1 BED [start, end) à preceding base is at start - 1 GFF [start, end] à next base is at end + 1 BED [start, end) à next base is at end 4

Represen;ng interval rela;onships We have a gene with three splicing variants Data representa;on Both BED and GFF files can represent them Two common versions of GFF à GTF2 and GFF3 (note: tool documenta.on can o/en wrong and shows a weird combina.on of these two formats) Start at 1000 ends at 8000, each exon is 1kb and is separated by 1kb How to represent this in a data format? In GFF the content of the ATTRIBUTE (9 th ) column specifies the rela;onship between features GTF azributes: GTF/GFF formats Example interval as GTF gene_id value; a globally unique iden.fier for the genomic source of the transcript transcript_id value a globally unique iden.fier for the predicted transcript. GFF azributes: gene_id G1 transcript_id T1 ID=exon1; Parent=T1 See the GFF3 site for exact specifica;on of the these mean. Important: More than one parent may be listed! A dis;nct line is entered for each exon, repeated for each transcript 5

Example interval as GFF 3 Example interval in BED The same exon may be part of different transcripts (parents) From the BED format specifica;on Visualizing in IGV Homework 24 Create and visualize in IGV an interval file that contains three splice variants of a 1kb long gene with 5 exons. Show the file and a screenshot 6