TCGA Variant Call Format (VCF) 1.0 Specification

Size: px
Start display at page:

Download "TCGA Variant Call Format (VCF) 1.0 Specification"

Transcription

1 TCGA Variant Call Format (VCF) 1.0 Specification Document Information Specification for TCGA Variant Call Format (VCF) Version About TCGA VCF specification 2 TCGA-specific customizations 3 File format 3.1 HEADER Generic meta-information INFO/FORMAT/FILTER meta-information TCGA-specific meta-information Column header meta-information 3.2 BODY Variant records 4 Extensions for TCGA data 4.1 Structural variants 4.2 Complex rearrangements 4.3 RNA-Seq variants 5 Validation rules 5.1 Handling failed checks 5.2 Test files Contents About TCGA VCF specification Variant Call Format (VCF) is a format for storing and reporting genomic sequence variations. VCF files are modular where the annotations and genotype information for a variant are separated from the call itself. As of May 2011, VCF version 4.1 (described here) is the most recent release. GSCs will generate sequence variation data using high-throughput sequencing technologies and resulting variations will be submitted to DCC as VCF files. TCGA has adopted VCF 4.1 with certain modifications to support supplemental information specific to the project. Subsequent sections describe the format TCGA VCF files should follow and validation steps that would have to be implemented at the DCC. TCGA-specific customizations The VCF 4.1 specification has been customized to support TCGA-specific variant information. While majority of the steps pertaining to the basic structure of the file remain the same, checks for supplemental information fields have been introduced. For example, TCGA VCF specification allows for additional fields to represent data associated with complex rearrangements, RNA-Seq variants, and sample-specific metadata. All TCGA-specific additions and modifications in validation steps are prefixed with a <TCGA-VCF> tag for convenient comparison with the 1000Genomes VCF 4.1 validator specification. The following table summarizes TCGA-specific customizations that have been added to the VCF 4.1 specification. The first column, "Customization type", indicates whether a new validation step has been introduced or if an existing step has been modified Table 1: TCGA-specific validation steps Customization type New Description Validation step # in TCGA-VCF 1.0 spec Validate that file contains ##tcgaversion HEADER line. Its presence indicates that the file is TCGA VCF and the value assigned to the field contains format version number Corresponding validation step # in VCF 4.1 spec New Additional mandatory header lines (Please refer to Table 2 ) #1 #1 New Validation of SAMPLE meta-information lines #15 New Validation of PEDIGREE meta-information lines #16

2 Modification Acceptable value set for CHROM has been modified #18a,b #16a Modification Acceptable value set for ALT has been modified #19 #17 New Validation for INFO sub-field "VT" has been added #22 New Validation for FORMAT sub-field "SS" has been added #23 New Validation for INFO/FORMAT sub-field "DP" has been added #24 New Validation for complex rearrangement records has been added #25 New Validation for RNA-Seq annotation fields has been added #26 File format The following example (based on VCF version 4.1) shows different components of a TCGA VCF file. Any VCF file contains two main sections. The HEADER section contains meta-information for variant records that are reported as individual rows in the BODY of the VCF file. Both sections are described below. Case-sensitivity: Please note that all fields and their associated validation rules are case-sensitive (as given in the specification) unless noted otherwise. Figure 1: Components of a sample TCGA VCF file HEADER The HEADER contains meta-information lines that provide supplemental information about variants contained in BODY of the file. HEADER lines could be formatted in the following two ways: ##key=value ##fileformat=vcfv4.1 ##filedate= or

3 ##FIELDTYPE=<key1=value1,key2=value2,...> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> Meta-information could be applicable either to all variant records in the file (e.g., date of creation of file) or to individual variants (e.g., flag to indicate whether a given variant exists in dbsnp). Generic meta-information Format: ##key=value OR ##FIELDTYPE=<key1=value1,key2=value2,...> The following table lists some of the reserved field names. Files can be customized to contain additional meta-information fields as long as they are not in conflict with reserved field names. The first field in Table 2 (fileformat) is mandatory and lists the VCF version number of the file. Table 2: Examples of generic meta-information fields Field Case-sensitive Description Sample values Required (fields in red are TCGA-specific requirements) fileformat Lists the VCF version number the file is based on; must be the first line in the file filedate Date file was created; should be in yyyymmdd format tcgaversion Indicates that the file follows TCGA-VCF specification. Format version number is assigned to the field. reference Reference build used for variant calling and against which variant coordinates are shown ##fileformat=vcfv4.1 ##filedate= ##tcgaversion=1.0 ##reference=1000genomespilot-ncbi36 OR ##reference=<id=hg18, Source= file://seq/references/1000genomespilot-ncbi36.fasta assembly External assembly file ##assembly=ftp://ftp-trace.ncbi.nih.gov/ 1000genomes/ftp/release/sv/breakpoint_assemblies.fasta (if a contig from an assembly file is being referred to in the VCF file, especially for breakends) center Name of the center where VCF file is generated. A comma-separated list can be provided if files from multiple centers are merged. phasing Indicates whether genotype calls are partially phased (phasing=partial) or unphased (phasing=none) ##center="broad" OR ##center="broad,ucsc,bcm" ##phasing=none

4 geneanno URL of the gene annotation source e.g., Generic Annotation File (GAF) vcfprocesslog Lists algorithm, version and settings used to generate variant calls in a VCF file. If multiple VCF files are processed to produce a single merged file, the field records attributes for individual VCF files and the programs used to merge the files along with the associated version, parameters and contact information of the person who produced the merged file. te: If VCF file does not represent a set of merged files, MergeSoftware, MergeParam, MergeVer and MergeContact tags will not be applicable and can be omitted. INDIVIDUAL Specifies the individual for which data is presented in the file ##geneanno= /GAF_bundle_Feb2011/outputs/TCGA.hg18.Feb2011.gaf ##vcfprocesslog=<inputvcf=<file1.vcf>, InputVCFSource=<varCaller1>, InputVCFVer=<1.0>, InputVCFParam=<a1,c2> InputVCFgeneAnno=<anno1.gaf>> OR ##vcfprocesslog=<inputvcf=<file1.vcf,file2.vcf,file3.vcf>, InputVCFSource=<varCaller1,varCaller2,varCaller3>, InputVCFVer=<1.0,2.1,2.0>, InputVCFParam=<a1,c2;a1,b1;a1,b1>, InputVCFgeneAnno=<anno1.gaf,anno2.gaf,anno3.gaf>, MergeSoftware=<sw1,sw2>, MergeParam=<a1,a2;b1,b2>, MergeVer=<2.1,3.0>, MergeContact=<johndoe@xyz.edu>> ##INDIVIDUAL=TCGA (if annotation tags like GENE, SID and RGN are used) INFO/FORMAT/FILTER meta-information Format: ##FIELDTYPE=<key1=value1,key2=value2,...> INFO, FORMAT and FILTER (case-sensitive values) are optional fields that have to be declared in the HEADER if they are being referred to in BODY of the file. Different keys that can be used to define them are described in Table 3. All three fields do not use the same set of keys. Please refer to individual field definitions for further details. Table 3: Description of keys used in INFO/FORMAT/FILTER meta-information declarations Key Case-sensitive Description Data type (Possible values) Additional notes ID name of the field; also used in BODY of the file to assign values for individual variant records Number specifies the number of values that can be associated with the corresponding field Type indicates data type of the value associated with the field String, no whitespaces, no comma Set (Integer >= 0, "A", "G", ".") Set (Integer, Float, Flag, Character, String) Any integer >= 0 indicating number of values; "A", if the field has one value per alternate allele; "G", if the field has one value per genotype; ".", if number of values varies, is unknown, or is unbounded "Flag" type indicates that the field does not contain a value entry, and hence the Number should be 0 in this case. FORMAT fields cannot have a "Flag" Type assigned to them.

5 Description provides a brief description of the field String, surrounded by double-quotes, cannot itself contain a double-quote, cannot contain trailing whitespace at the end of string before closing quotes INFO lines Format: ##INFO=<key1=value1,key2=value2,...> Required keys: ID, Type, Number, Description INFO fields are optional and contain additional annotations for a variant. Certain INFO fields have already been created and exist as reserved fields in the current VCF standard. Custom INFO fields can be added based on study requirements as long as they do not use the reserved field names. If an INFO field is declared in the header, it needs to be described further using the following format: ##INFO=<ID=ID,Number=number,Type=type,Description= description > ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> FORMAT lines Format: ##FORMAT=<key1=value1,key2=value2,...> Required keys: ID, Type, Number, Description FORMAT declaration lines are optional and are used when annotations need to be added for individual genotypes associated with each sample in the file. FORMAT sub-fields are declared precisely as the INFO sub-fields with the exception that a FORMAT sub-field cannot be assigned a "Flag" Type. ##FORMAT=<ID=ID,Number=number,Type=type,Description= description > ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> FILTER lines Format: ##FILTER=<key1=value1,key2=value2,...> Required keys: ID, Description FILTER fields are defined to list filtering criteria used for generating variant calls. Custom filters can be applied as long as a definition is provided in the HEADER. FILTERs that have been applied to the data should be described as follows. Please note that FILTER declarations do not include Type or Number keys. ##FILTER=<ID=ID,Description= description > ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> TCGA-specific meta-information PEDIGREE lines Format: ##PEDIGREE=<key1=value1,key2=value2,...> Required keys: Name_0,..,Name_N where N >= 1; PEDIGREE lines are used to specify derivation relationships between different genomes. Name_0 is associated with the derived genome and

6 Name_1 through Name_N represent the genomes from which it is derived. In the case of tumor clonal populations, one population is clonally derived from another. In the example below, PRIMARY-TUMOR-GENOME is derived from GERMLINE-GENOME. ##PEDIGREE=<Name_0=<G0-ID>,Name_1=<G1-ID>,...,Name_N=<GN-ID>> where N is >= 1; ##PEDIGREE=<Name_0=PRIMARY-TUMOR-GENOME,Name_1=GERMLINE-GENOME> SAMPLE lines Format: ##SAMPLE=<key1=value1,key2=value2,...> Required keys: ID, Individual, File, Platform, Source, Accession SAMPLE lines are used to include additional metadata about each sample for which data is represented in the VCF file. All samples are listed in the column header line following the FORMAT column (Figure 1). Each of these samples should have its own HEADER declaration where the sample identifier in the column header should be the same as the value assigned to "ID" key in the corresponding declaration. The declaration lists information about the sample (source, platform, source file, etc.) and can also be used to indicate if the sample is a mixture of different kind of genomes. In the example below, "Genomes", "Mixture" and 'Genome_Description" tags represent comma-separated list of different genomes that a sample contains, proportion of each genome in the sample, and a brief description of each genome respectively. ##SAMPLE=<ID=id,SampleName=sampleName,Individual=individual,Description="description",File=bamfile,Platfo contamination","tumor genome">> "Description" field for genome mixture has been renamed to "Genome_Description" to distinguish it from sample description. Values for tags related to genome mixture (Genomes, Mixture, Genome_Description) are within angle brackets. Column header meta-information Format: Tab-delimited line starting with "#" and containing headers for all columns in the BODY as shown below. This is a mandatory header line where the first 8 fields are fixed and have to defined in the column header. "FORMAT" onwards are optional and are included to encapsulate per-sample/genome genotype data. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT <SAMPLE1 or GENOME1> <SAMPLE2 or GENOME2>... BODY Variant records Data lines are tab-delimited and list information about individual variants and associated genotypes across samples. The first 8 fields (Figure 1) are required to be listed in the VCF column header line. Some of these fields require non-null values (see Table 4) for each record. For the remaining fixed fields, even if the field does not have an associated value, it still needs to be specified with a missing value identifier ("." in VCF 4.1). Subsequent fields are optional. Table 4: Description of fields in the BODY of a VCF file Index Field Case-sensitive Description Data type (Possible values) Sample values Required* Additional notes

7 1 CHROM Chromosome: an identifier from the reference genome or the assembly file defined in the HEADER. Alphanumeric string ([1-22], X, Y, MT, <ID>) 20 <ctg1> Chromosome name should not contain "chr" prefix, e.g., "chr10" will be an invalid entry 2 POS Position: The reference position, with the 1st base having position 1. n-negative integer ID Identifier: Semi-colon separated list of unique identifiers if available. String, no white-space or semi-colons rs REF Reference allele(s) : Reference allele at the position. String ([ACGTN]+ ) GTCT Value in POS field refers to the position of the first base in the REF string. 5 ALT Alternate allele(s) : Comma separated list of alternate non-reference alleles called on at least one of the samples. Angle-bracketed ID String ( <ID> ) can also be used for symbolically representing alternate alleles. String; no whitespace, commas, or angle-brackets in the ID string ([ACGTN]+, < ID>,.) G,GTCT. <INS:ME:ALU> if ALT==<ID>, ID needs to be defined in the header as ##ALT=<ID=Id,Description= "description"> 6 QUAL Quality score: Phred-scaled quality score for the assertion made in ALT. Integer >= 0 50 Scores should be non-negative integers or missing values 7 FILTER Filtering results: PASS if this position has passed all filters, Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. String, no whitespace or semi-colon PASS q10;s50 "0" is reserved and cannot be used as a filter String. 8 INFO Additional information: INFO fields are encoded as a semicolon-separated series of keys (same as ID in an INFO declaration) with optional values in the format <key=value>. String, no whitespace, semi-colons, or equal-signs NS=3;DP=14; 9 FORMAT Genotype sub-fields: If genotype data is present in the file, the fixed fields are followed by a FORMAT column. The field contains a colon-separated list of all pre-defined FORMAT sub-fields (same as ID in a FORMAT declaration) that are applicable to all samples that follow. String, no whitespace, sub-fields cannot contain colon GT:GQ:DP:HQ "GT" must be the first sub-field if it is present in the FORMAT field.

8 10 <SAMPLE> Case should be same as in "ID" tag of ##SAMPLE declaration in the header Per-sample genotype information: An arbitrary number of sample IDs can be added to the column header line and a variant record in the BODY can contain genotype information corresponding to FORMAT column for each sample. Contains a colon-separated list of values assigned to each of the sub-fields in FORMAT column. String, no whitespace, sub-fields cannot contain colon 0 0:48:1:51,51 Values are assigned to FORMAT sub-fields in the SAME order as specified in "FORMAT" column. All samples in any given row for a variant record MUST contain values for all sub-fields as defined in "FORMAT" column. If any of the fields does not have an associated value, then missing value identifier (".") should be used for that field. However, "." cannot be used as a value for any of the IDs in the FORMAT field (e.g., GT:.:DP would lead to an error). * A "Required" field cannot contain missing value identifier for any record listed in data lines. Extensions for TCGA data TCGA data includes but is not limited to SNP's and small indels. A variant representation format for cancer data should be able to support more complex variation types such as structural variants, complex rearrangements and RNA-Seq variants. The following sub-sections present an overview of the extensions that have been added to clearly describe such variations in a VCF file. Structural variants A structural variant (SV) can be defined as a region of DNA that includes a variation in the structure of the chromosome. Such variations could be due to inversions and balanced translocations or genomic imbalances (insertions and deletions), also referred to as copy number variants (CNVs). Certain features have been added to the format in order to clearly describe structural variants in a VCF file. A detailed description of the extensions is available here. Complex rearrangements Chromosomal rearrangements are caused by breakage of DNA double helices at two different locations. The broken ends in turn rejoin to produce a new chromosomal arrangement. Complex rearrangements involving more than two breaks are frequently observed in cancer genomes. Certain modifications need to be made to the VCF standard to adequately represent such variations in a VCF file. A detailed specification of the proposed extensions to describe rearrangements in a VCF file is available here. Figure 2 illustrates some of the concepts relevant to VCF records for complex rearrangements. Figure 2: Adjacencies and breakends in a chromosomal rearrangement (adapted from VCF 4.1 specification)

9 A VCF file has one line for each of the two breakends in an adjacency. Table 5 provides a list of sub-fields that have been added to describe breakends. An INFO sub-field ( SVTYPE=BND) is used to indicate a breakend record. Sub-fields MATEID and PARID are used to represent variant record IDs of corresponding mates and partners respectively. Table 5: Fields added for breakends Field:Sub-field Description Declaration in HEADER (Sample values in BODY) Required INFO: SVTYPE Type of structural variant; SVTYPE is set to "BND" for breakend records ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> SVTYPE=BND (SVTYPE=BND for breakend records) INFO: MATEID ID of corresponding mate of the breakend record ##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakend"> MATEID=bnd_U INFO: PARID ID of corresponding partner of the breakend record ##INFO=<ID=PARID,Number=.,Type=String,Description="ID of partner breakend"> PARID=bnd_V INFO: EVENT ID of event associated to breakend ##INFO=<ID=EVENT,Number=.,Type=String,Description="ID of breakend event"> EVENT=RR0 The specification for ALT field deviates from the standard format for breakend records. ALT field for a breakend record can be represented in four possible ways based on the type of replacement. REF ALT Description s t[p[ piece extending to the right of p is joined after t s t]p] reverse comp piece extending left of p is joined after t s ]p]t piece extending to the left of p is joined before t s [p[t reverse comp piece extending right of p is joined before t Legend: s: sequence of REF bases beginning at position POS t: sequence of bases that replaces "s" p: position of the breakend mate indicating the first mapped base that joins at the adjacency; represented as a string of the form "chr:pos" []: square brackets indicate direction that the joined sequence continues in, starting from p

10 RNA-Seq variants VCF specifications have been extended to address expressed variants obtained from RNA-Seq. Features added for structural variants from genome/exome sequencing are applicable to RNA-Seq structural variants. However, RNA-Seq breakends are represented by setting SVTYPE=FND instead of BND (Table 6) since they can be different from those observed in DNA-Seq. Table 6: Fields added for RNA-Seq variants Field:Sub-field Description Declaration in HEADER (Sample values in BODY) Required INFO: SVTYPE Type of structural variant; SVTYPE is set to "FND" for breakends associated with RNA-Seq ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> SVTYPE=FND (required for RNA-Seq breakend records; SVTYPE=FND) VCF files for RNA-Seq variants may include gene-related annotations. However, this is not a standard feature of VCF files as eventually all VCF variants will be annotated using information in Generic Annotation File (GAF). Additional INFO and FORMAT sub-fields have been included to describe the characteristics of expressed nucleotide variants (Table 6a). Table 6a: Annotation fields added for RNA-Seq variants Field:Sub-field Description Declaration in HEADER (Sample values in BODY) Required INFO:SID Unique identifiers from the gene annotation source as specified in ##geneanno; "unknown" should be used if identifier is not known; comma-separated list of IDs can be used if variant overlaps with multiple features ##INFO=<ID=SID,Number=.,Type=String,Description= Unique identifier from gene annotation source or unknown > SID=13,198 INFO:GENE HUGO gene symbol; "unknown" should be used when gene symbol is unknown; comma-separated list of genes can be used if variant overlaps with multiple transcripts/genes ##INFO=<ID=GENE,Number=.,Type=String,Description= HUGO gene symbol > GENE=ERBB2,ERBB2 INFO:RGN Region where a nucleotide variant occurs in relation to a gene ##INFO=<ID=RGN,Number=.,Type=String,Description= Region where nucleotide variant occurs in relation to a gene > RGN=exon,3_utr INFO:RE Flag to indicate if position is known to have RNA-edits occur ##INFO=<ID=RE,Number=0,Type=Flag,Description= Position known to have RNA-edits to occur > RE FORMAT:TE Translational effect of a nucleotide variant in a codon ##FORMAT=<ID=TE,Number=.,Type=String,Description="Translational effect of the variant in a codon"> MIS,NA Validation rules At the minimum, every file needs to go through the checks listed below. Following is an example of a VCF file that shows certain violations cited in the listed validation steps. Please note that line numbers in the file segment below are added for illustration purposes alone and are not expected to be found in an actual VCF file.

11 Line1 ##fileformat=vcfv4.1 Line2 ##filedate= Line3 ##source=myimputationprogramv3.1 Line4 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta Line5 ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> Line6 ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership"> Line7 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> Line8 ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> Line9 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> Line10 ##FORMAT=<ID=PL,Number=3,Type=Integer,Description=" rmalized Phred-scaled likelihoods for AA, AB, BB genotypes "> Line11 ##FILTER=<ID=q10,Description="Quality below 10"> Line12 ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> Line13 FILTER=<ID=c10,Description="Shallow coverage below 10x"> Line14 ##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element"> Line15 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCGA TCGA Line var1 G A 29 q10 NS=2;DP=14 GT:GQ:DP 0 0:48 0 1:48:3 Line var2 G A 35 q10;s50 NS=2.5 GQ:GT 48:0 0 51:0 1 Line var3 C T 30 q10;s10 NS=2 GT:GQ:DP 0/2:48:3 0/1:51:4 Line rs123 C <DEL:ME:ALU> 12 PASS NS=3;DB GT:GQ 0/1:50 1/1:40 Line A <DUP> 20 PASS NS=3 GT:GQ:PL 0/1:49:42,3 1/1:38:96,47/70 Line rs456 T C 15 PASS NS=3/DB GT 0/1 1/1 Important: A file will be validated as a TCGA VCF file only if it contains ##tcgaversion HEADER line (e.g., ##tcgaversion=1.0). The current acceptable version is Mandatory header lines should be present. All meta-information header lines should be prefixed with "##". Column header line should be prefixed with "#". A VCF file can contain only a single column header line that must contain all required field names. Any line lacking the "##" or "#" prefix will be assumed to be a BODY data line and will have to follow the specified format. For example, Line13 leads to a violation as it lacks "##" or "#" but is not a tab-delimited row containing variant information. HEADER lines cannot be present within the BODY of a file and vice-versa. INFO, FORMAT and FILTER declarations should follow the format below where all keys are required but the order of keys is irrelevant. ##INFO=<ID=id,Number=number,Type=type,Description="description"> ##FORMAT=<ID=id,Number=number,Type=type,Description="description"> ##FILTER=<ID=id,Description="description"> 7. Values assigned to ID, Number, Type and Description in INFO, FORMAT or FILTER declarations should follow the rules listed below. A detailed description of the declaration format is provided here. a. ID, Number, Type!~ /(\s, = ;)/ b. Number is in {Integer>=0, "A", "G", "."} c. Type is in {Integer, String, Float, Flag, Character} d. Description should be within double quotes and cannot itself contain a double quote e. Description string cannot contain leading or trailing whitespace after opening or before closing quotation marks; Line10 shows a violation as Description string contains leading and trailing whitespace. f. If ID == "FORMAT", then Type!= "Flag" 8. Any INFO, FORMAT or FILTER sub-fields used in the BODY are required to be defined in the HEADER. For example, var1 (Line16) shows an example of a violation as read depth "DP" is assigned a value (DP=14) without being defined as an INFO sub-field in the HEADER. 9. Validation of INFO sub-fields: a. An INFO sub-field should be included for a variant record in the BODY as <key=value> (e.g., NS=2) where key is the "ID" value of the sub-field in the HEADER declaration. Exception: An INFO field of "Flag" Type will not be assigned a value in the BODY. The presence of a flag in INFO column merely indicates that the variant record satisfies a condition associated with the flag. For example, Line19 has a "DB" flag without a value entry in the INFO column. "DB" in the INFO column indicates that the variant exists in dbsnp. b. Multiple INFO sub-fields can be associated with a single variant record using ";" as a separator (e.g., Line16). Line21 has a violation as "/" is used as a separator in INFO column. 10. Validation of FORMAT sub-fields: a. FORMAT column for a variant record contains a colon-separated list of all pre-defined FORMAT sub-fields (identified by "ID" value in the HEADER declaration) that are applicable to all samples that follow. A ":" is the only valid separator for sub-fields.

12 b. Number of colon-separated sub-fields in FORMAT column should equal to number of colon-separated values assigned to each sample. For example, var1 (Line16) violates this rule for the sample TCGA as there are 3 sub-fields in FORMAT column but only 2 values in the sample column. c. If "GT" sub-field is defined in the FORMAT field for a variant, it must be the first sub-field in the string. For example, var2 (Line17) violates this rule as GT is not the first sub-field even though it is present in the FORMAT field. i. GT is not a required sub-field and can be omitted for a variant row if none of the samples have genotype calls available. ii. GT represents the genotype, encoded as allele values separated by either of / (genotype unphased) or (genotype phased). The allele values are 0 for the reference allele (in REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. Examples: 0/1, 1 0, or 1/2, etc. iii. GT is assigned only one allele value for haploid calls (e.g. on Y chromosome). Therefore, if CHROM=="Y" then GT should have only one allele value assigned to it (e.g., "1", "0", ".", etc.) instead of two alleles (e.g., "1/1", "0 0"). If CHROM=="MT" then there is no constraint on the number of alleles as long as the number is bounded within the alleles listed in REF and/or ALT (e.g., 0/1, 0/1/2, 1 are all valid values for MT if REF and ALT have one and two allele values respectively). iv. If GT is present in the "FORMAT" column, then all samples should have values assigned to the field for that variant. If an allele cannot be called for a sample at a given locus,. will be specified for each missing allele in the GT field (for example "./." for a diploid genotype and "." for haploid genotype). v. Validation should include ensuring that allele number in GT is within the range of alleles specified in ALT and REF. For example, var3 (Line18) violates this rule as it lists GT as "0/2" for sample TCGA but ALT contains only one allele so the only acceptable allele numbers are 0 (REF) and 1 (ALT). If an INFO or FORMAT sub-field is declared in the header AND is assigned a value for a variant record in the body, the data type should be consistent with the expected type defined in the Type key of the corresponding declaration. For example, var2 (Line17) violates this rule as the definition for "NS" INFO sub-field states the data type is integer whereas the variant record contains a float value (2.5) assigned to the sub-field. Exception: The rule does not apply if Type of a field is not defined or is incorrectly defined (e.g., field not declared in HEADER, Type not included in declaration, incorrect value for Type). It also does not apply to any missing values (denoted with ".") in the record as they do not have an associated data type. Multiple comma-separated values (corresponding to value assigned to Number key in declaration) can be specified for an INFO or FORMAT sub-field for a variant record. other character can be used as separator. Line20 shows a violation as a "/" is used as separator between 2nd and 3rd values for "PL" FORMAT sub-field in the second sample column. If Number tag is assigned a known bounded value (an integer, "A", "G") for an INFO/FORMAT sub-field, it should be consistent with number of values specified for any variant record in BODY of file. For example, Line20 shows a violation as "PL" is associated with 3 integer values (Line10) but the variant record has only 2 comma-separated integer values (42,3) for TCGA Validation of FILTER sub-fields: a. Valid values for FILTER column are "PASS" or a code for the filter that the variant call fails (e.g., "q10" in Line16). The code must correspond to the "ID" value of the corresponding FILTER declaration. b. If a call fails multiple filters, FILTER column should contain semicolon-separated list of all failed filter codes (e.g., "q10;s50" in Line17). A ";" is the only valid separator. c. All codes listed in the FILTER column must have a well-formed declaration in the HEADER. Line18 shows a violation as "q10" does not have an associated definition in the HEADER. <TCGA-VCF> Validation of SAMPLE meta-information lines: a. Each sample ID in the column header (immediately after FORMAT column) must have an associated HEADER declaration where value assigned to "ID" tag in the declaration is the same as sample ID used in the column name. b. Declaration must contain all required fields. c. Genome mixture tags (Genomes, Mixture, Genome_Description) are enclosed within angle brackets (<>) and can have multiple comma-separated values. d. If more than one of the genome mixture tags (Genomes, Mixture, Genome_Description) are defined in a SAMPLE meta-information line, then number of comma-separated values should be the same for all defined tags. For example, "Genomes=<G1,G2>,Mixture=<0.1,0.8,0.1>" would lead to a violation as Mixture has 3 values while Genomes has only 2 values. e. Individual values in "Genomes" are strings without white-space, comma or angle brackets. f. Individual values in "Mixture" represent proportion (floating point number >= 0 and <= 1) of each genome in the sample and all comma-separated values should add up to a sum of 1. g. Individual values in "Genome_Description" are strings surrounded by double quotes where the string itself cannot contain a double quote. <TCGA-VCF> Validation of PEDIGREE meta-information lines: a. Declaration line should follow the format: ##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID> where: 17. i. ii. iii. N >= 1 Name_0 through Name_N are strings that cannot contain white-space, comma, or angle brackets. G0-ID through GN-ID are strings that cannot contain white-space, comma, or angle brackets. Each of these should be a header for the genotype columns immediately after FORMAT column and should be defined using "ID" tag in the corresponding ##SAMPLE meta-information line. Validation of custom meta-information fields: a. If a user-created custom meta-information declaration is encountered and the corresponding key/value structure and content have not been defined in this specification, the line should be validated to ensure it follows one of the following two formats:

13 ##key=value ##<INDIVIDUAL=TCGA > OR ##FIELDTYPE=<key1=value1,key2=value2,...> ##contig=<id=20,length= ,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="h sapiens",taxonomy=x> where: 18. i. ii. key!~ /(\s, = ;)/ value!~ /(\s, = ;)/ UNLESS value is within double quotes, in which case it cannot itself contain a double quote or leading/trailing whitespace OR if value is within angle brackets. CHROM, POS, and REF are required fields and cannot contain missing value identifiers. Please refer to Table 4 for acceptable values. a. <TCGA-VCF> CHROM is in {[1-22], X, Y, MT,<chr_ID>} where chr_id cannot contain whitespace or <> b. If CHROM == <chr_id> then the VCF file MUST have a declaration for assembly file in the HEADER in the following format: ##assembly=url ##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fas 19. c. POS is a non-negative integer d. REF =~ /[ACGTN]+/ <TCGA-VCF> ALT is in {[ACGTN]+, ".", <ID>, SV_ALT }; a. String SV_ALT can be in one of the following four formats and can be used in the ALT field ONLY when the corresponding INFO field has the key-value pair "SVTYPE=BND" or "SVTYPE=FND". Format seq[chr:pos[ seq]chr:pos] ]chr:pos]seq [chr:pos[seq Example G[17:198982[ GC]1:238909] ]<ctg1>:235788]gcna [1: [ACT 20. i. seq is in {[ACGTN]+, "."} ii. chr is in {<chr_id>, [1-22], X, Y, MT} where chr_id is a string iii. pos is a non-negative integer b. Similar to 18b, if chr == <chr_id> (where chr_id is a string) then the VCF file must have an ##assembly declaration in the HEADER. c. If ALT is assigned a value in <ID> format, (e.g., rs123 in Line19), <ID> should be defined in the HEADER as ##ALT=<ID= ID,Description=" Description"> (Line14) where ID cannot contain white-space or angle brackets. Line20 shows a violation of this rule as ALT== <DUP> but there is no corresponding ALT declaration in the HEADER with <ID=DUP>. d. ALT can contain multiple comma-separated values. other character can be used as a separator. two records are allowed to have the the same ID value. Two records can, however, have the same CHROM and POS values. variant record. 26. <TCGA-VCF> Validation of RNA-Seq annotation fields: a. b. c. d. where: Exception: Multiple records in a file are allowed to have the same missing value identifier (".") as ID. 21. QUAL field can only contain non-negative integers or "." (missing value). 22. <TCGA-VCF> If INFO sub-field "VT" is declared and used in the BODY, its value can only be in {SNP, INS, DEL} 23. <TCGA-VCF> If FORMAT sub-field "SS" is declared and used in the BODY, its value can be 0, 1, 2, 3, or 4 depending on whether relative to normal the variant is wildtype, germline, somatic, LOH or unknown respectively. 24. <TCGA-VCF> "DP" sub-field for read depth can be defined in INFO (combined depth across all samples) or FORMAT (depth in a specific sample) field. If both INFO and FORMAT have values for the sub-field, then sum of DP values across all FORMAT sample columns should be equal to DP value in the INFO field. 25. <TCGA-VCF> Validation of complex rearrangement records: a. If INFO field includes key-value pairs "SVTYPE=BND" or "SVTYPE=FND" and has values for "MATEID" and/or "PARID", then the value (or multiple comma-separated values) assigned to MATEID or PARID should exist in the file as "ID" field for another If INFO field includes "SID", "GENE" or "RGN" keys with associated values, then file MUST contain a declaration for ##geneanno in the HEADER. Number of comma-separated values in the optional INFO sub-fields "SID", "GENE" and "RGN" and the FORMAT sub-field "TE" must be the same if more than one of these sub-fields are defined for a record. INFO sub-field "RGN" is in {5_utr, 3_utr, exon, intron, ncds, sp}. FORMAT sub-field "TE" is in {SIL, MIS, NSNS, NSTP, FSH, NA}

14 27. e. If "RGN" and "TE" have the same number of comma-separated values, then "RGN" must be "exon" for "TE" to have any value other than "NA". For example, if "RGN=exon,intron,intron" then having "MIS,SIL,NA" for TE would lead to a violation as the 2nd value for RGN is "intron" but the corresponding TE value is "SIL" instead of "NA". <TCGA-VCF> Validation of vcfprocesslog tags: ##vcfprocesslog=<inputvcf=<file1.vcf>,inputvcfsource=<varcaller1>,inputvcfver=<1.0>,inputvcfparam= a. Individual values for each tag are enclosed within angle brackets (<>) instead of double quotes. b. If a field contains multiple values, they are separated by comma. Exception: Separator for multiple values in InputVCFParam and MergeParam is a ";" instead of ",". Individual values within these tags can contain comma-separated parameters (e.g., <a1,c2;a1,b1;a1,b1 in the example given above). c. If InputVCF tag has multiple comma-separated values assigned to it (please refer to the second example above), then InputVCFSource, InputVCFVer, InputVCFParam, and InputVCFgeneAnno must contain the same number of values. If a value is not known, it should be substituted with the missing value identifier ("."). d. If InputVCF contains only a single value, then all tags =~ /Merge.*/ are optional and can either be omitted or can contain missing value identifier ("."). The reason is that attribute related to merging VCF files are applicable only if multiple input VCF files are being merged. e. If MergeSoftware contains multiple comma-separated values, MergeParam and MergeVer should contain the same number of values. There is no such constraint for MergeContact. Handling failed checks A VCF file would be required to pass ALL the checks listed above and any violation will lead to a "Failed" validation. Even if a failure is encountered, the file would still need to go through all other checks for validation to be complete. Exception to this requirement would include cases where execution of one validation check is dependent on the success of another prerequisite step. For example, number of values associated with a FORMAT field for a variant record cannot be validated if the field itself is not declared in the HEADER or has a missing Number tag. A summary of all failed checks should be provided as an output. Test files The following table lists sample files that can be used to test various validation steps. Parent directory is Table 7: Test files known to pass/fail validation steps Expected result Success Validation file with no failures Test file /vcfformat/tcga _illuminaga-dnaseq_exome.format2.vcf Failure "chr" prefix for chromosome names missing values for FORMAT fields /BI/ /BCM-GBM_solid.TCGA mut.vcf Failure column header line has no "#" as beginning trailing whitespace at the end of description strings in metadata filter not listed in FILTER metadata line in multi-allelic sites, different alternative alleles in the ALT field are separated by "/" instead of "," /genome.wustl.edu/ /tcga a-01w vcf.gz Failure double quotes missing from SAMPLE metadata lines Description /BCM/TCGA _IlluminaGA-DNASeq_exome.vcf Failure scores in QUAL column are negative integers /UCSC/ /TCGA _W_capture.vcf

Briefly: Bioinformatics File Formats. J Fass September 2018

Briefly: Bioinformatics File Formats. J Fass September 2018 Briefly: Bioinformatics File Formats J Fass September 2018 Overview ASCII Text Sequence Fasta, Fastq ~Annotation TSV, CSV, BED, GFF, GTF, VCF, SAM Binary (Data, Compressed, Executable) Data HDF5 BAM /

More information

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja / From fastq to vcf Overview of resequencing analysis samples fastq fastq fastq fastq mapping bam bam bam bam variant calling samples 18917 C A 0/0 0/0 0/0 0/0 18969 G T 0/0 0/0 0/0 0/0 19022 G T 0/1 1/1

More information

The Variant Call Format (VCF) Version 4.2 Specification

The Variant Call Format (VCF) Version 4.2 Specification The Variant Call Format (VCF) Version 4.2 Specification 17 Dec 2013 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version c02ad4c from that

More information

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for

More information

NGS Data Analysis. Roberto Preste

NGS Data Analysis. Roberto Preste NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr

More information

Analysing re-sequencing samples. Malin Larsson WABI / SciLifeLab

Analysing re-sequencing samples. Malin Larsson WABI / SciLifeLab Analysing re-sequencing samples Malin Larsson Malin.larsson@scilifelab.se WABI / SciLifeLab Re-sequencing Reference genome assembly...gtgcgtagactgctagatcgaaga...! Re-sequencing IND 1! GTAGACT! AGATCGG!

More information

Analysing re-sequencing samples. Anna Johansson WABI / SciLifeLab

Analysing re-sequencing samples. Anna Johansson WABI / SciLifeLab Analysing re-sequencing samples Anna Johansson Anna.johansson@scilifelab.se WABI / SciLifeLab Re-sequencing Reference genome assembly...gtgcgtagactgctagatcgaaga... Re-sequencing IND 1 GTAGACT AGATCGG GCGTAGT

More information

INTRODUCTION AUX FORMATS DE FICHIERS

INTRODUCTION AUX FORMATS DE FICHIERS INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan

More information

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September 27 2014 Static Dynamic Static Minimum Information for Reporting

More information

Intro to NGS Tutorial

Intro to NGS Tutorial Intro to NGS Tutorial Release 8.6.0 Golden Helix, Inc. October 31, 2016 Contents 1. Overview 2 2. Import Variants and Quality Fields 3 3. Quality Filters 10 Generate Alternate Read Ratio.........................................

More information

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 SAM and VCF formats UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 File Format: SAM / BAM / CRAM! NEW http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and

More information

Data Walkthrough: Background

Data Walkthrough: Background Data Walkthrough: Background File Types FASTA Files FASTA files are text-based representations of genetic information. They can contain nucleotide or amino acid sequences. For this activity, students will

More information

BaseSpace Variant Interpreter Release Notes

BaseSpace Variant Interpreter Release Notes Document ID: EHAD_RN_010220118_0 Release Notes External v.2.4.1 (KN:v1.2.24) Release Date: Page 1 of 7 BaseSpace Variant Interpreter Release Notes BaseSpace Variant Interpreter v2.4.1 FOR RESEARCH USE

More information

Introduction to GDS. Stephanie Gogarten. August 7, 2017

Introduction to GDS. Stephanie Gogarten. August 7, 2017 Introduction to GDS Stephanie Gogarten August 7, 2017 Genomic Data Structure Author: Xiuwen Zheng CoreArray (C++ library) designed for large-scale data management of genome-wide variants data format (GDS)

More information

Assignment 7: Single-cell genomics. Bio /02/2018

Assignment 7: Single-cell genomics. Bio /02/2018 Assignment 7: Single-cell genomics Bio5488 03/02/2018 Assignment 7: Single-cell genomics Input Genotypes called from several exome-sequencing datasets derived from either bulk or small pools of cells (VCF

More information

Importing Next Generation Sequencing Data in LOVD 3.0

Importing Next Generation Sequencing Data in LOVD 3.0 Importing Next Generation Sequencing Data in LOVD 3.0 J. Hoogenboom April 17 th, 2012 Abstract Genetic variations that are discovered in mutation screenings can be stored in Locus Specific Databases (LSDBs).

More information

Introduction to GDS. Stephanie Gogarten. July 18, 2018

Introduction to GDS. Stephanie Gogarten. July 18, 2018 Introduction to GDS Stephanie Gogarten July 18, 2018 Genomic Data Structure CoreArray (C++ library) designed for large-scale data management of genome-wide variants data format (GDS) to store multiple

More information

SweeD 3.0. Pavlos Pavlidis & Nikolaos Alachiotis

SweeD 3.0. Pavlos Pavlidis & Nikolaos Alachiotis 1 SweeD 3.0 Pavlos Pavlidis & Nikolaos Alachiotis Contents 1 Introduction 1 2 The Site Frequency Spectrum (SFS) pattern of selective sweeps 3 2.1 The selective sweep model as implemented by Nielsen et

More information

Allele Registry. versions 1.01.xx. API specification. Table of Contents. document version 1

Allele Registry. versions 1.01.xx. API specification. Table of Contents. document version 1 Allele Registry versions 1.01.xx API specification document version 1 Table of Contents Introduction...3 Sending HTTP requests...3 Bash...3 Ruby...4 Python...5 Authentication...5 Error responses...6 Parameter

More information

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010 Next Generation Sequence Alignment on the BRC Cluster Steve Newhouse 22 July 2010 Overview Practical guide to processing next generation sequencing data on the cluster No details on the inner workings

More information

RVD2.7 command line program (CLI) instructions

RVD2.7 command line program (CLI) instructions RVD2.7 command line program (CLI) instructions Contents I. The overall Flowchart of RVD2 program... 1 II. The overall Flow chart of Test s... 2 III. RVD2 CLI syntax... 3 IV. RVD2 CLI demo... 5 I. The overall

More information

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.)

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.) Axiom Analysis Suite 4.0.1 Release Notes (For research use only. Not for use in diagnostic procedures.) Axiom Analysis Suite 4.0.1 includes the following changes/updates: 1. For library packages that support

More information

SOLiD GFF File Format

SOLiD GFF File Format SOLiD GFF File Format 1 Introduction The GFF file is a text based repository and contains data and analysis results; colorspace calls, quality values (QV) and variant annotations. The inputs to the GFF

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM). Release Notes Agilent SureCall 4.0 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

GenomeStudio Software Release Notes

GenomeStudio Software Release Notes GenomeStudio Software 2009.2 Release Notes 1. GenomeStudio Software 2009.2 Framework... 1 2. Illumina Genome Viewer v1.5...2 3. Genotyping Module v1.5... 4 4. Gene Expression Module v1.5... 6 5. Methylation

More information

fasta2genotype.py Version 1.10 Written for Python Available on request from the author 2017 Paul Maier

fasta2genotype.py Version 1.10 Written for Python Available on request from the author 2017 Paul Maier 1 fasta2genotype.py Version 1.10 Written for Python 2.7.10 Available on request from the author 2017 Paul Maier This program takes a fasta file listing all sequence haplotypes of all individuals at all

More information

SEQGWAS: Integrative Analysis of SEQuencing and GWAS Data

SEQGWAS: Integrative Analysis of SEQuencing and GWAS Data SEQGWAS: Integrative Analysis of SEQuencing and GWAS Data SYNOPSIS SEQGWAS [--sfile] [--chr] OPTIONS Option Default Description --sfile specification.txt Select a specification file --chr Select a chromosome

More information

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1- 1. PURPOSE To provide instructions for finding rs Numbers (SNP database ID numbers) and increasing sequence length by utilizing the UCSC Genome Bioinformatics Database. 2. MATERIALS 2.1. Sequence Information

More information

Click on "+" button Select your VCF data files (see #Input Formats->1 above) Remove file from files list:

Click on + button Select your VCF data files (see #Input Formats->1 above) Remove file from files list: CircosVCF: CircosVCF is a web based visualization tool of genome-wide variant data described in VCF files using circos plots. The provided visualization capabilities, gives a broad overview of the genomic

More information

UCSC Genome Browser ASHG 2014 Workshop

UCSC Genome Browser ASHG 2014 Workshop UCSC Genome Browser ASHG 2014 Workshop We will be using human assembly hg19. Some steps may seem a bit cryptic or truncated. That is by design, so you will think about things as you go. In this document,

More information

Isaac Enrichment v2.0 App

Isaac Enrichment v2.0 App Isaac Enrichment v2.0 App Introduction 3 Running Isaac Enrichment v2.0 5 Isaac Enrichment v2.0 Output 7 Isaac Enrichment v2.0 Methods 31 Technical Assistance ILLUMINA PROPRIETARY 15050960 Rev. C December

More information

Association Analysis of Sequence Data using PLINK/SEQ (PSEQ)

Association Analysis of Sequence Data using PLINK/SEQ (PSEQ) Association Analysis of Sequence Data using PLINK/SEQ (PSEQ) Copyright (c) 2018 Stanley Hooker, Biao Li, Di Zhang and Suzanne M. Leal Purpose PLINK/SEQ (PSEQ) is an open-source C/C++ library for working

More information

Universal Format Plug-in User s Guide. Version 10g Release 3 (10.3)

Universal Format Plug-in User s Guide. Version 10g Release 3 (10.3) Universal Format Plug-in User s Guide Version 10g Release 3 (10.3) UNIVERSAL... 3 TERMINOLOGY... 3 CREATING A UNIVERSAL FORMAT... 5 CREATING A UNIVERSAL FORMAT BASED ON AN EXISTING UNIVERSAL FORMAT...

More information

MAGA: Meta-Analysis of Gene-level Associations

MAGA: Meta-Analysis of Gene-level Associations MAGA: Meta-Analysis of Gene-level Associations SYNOPSIS MAGA [--sfile] [--chr] OPTIONS Option Default Description --sfile specification.txt Select a specification file --chr Select a chromosome DESCRIPTION

More information

Read Naming Format Specification

Read Naming Format Specification Read Naming Format Specification Karel Břinda Valentina Boeva Gregory Kucherov Version 0.1.3 (4 August 2015) Abstract This document provides a standard for naming simulated Next-Generation Sequencing (Ngs)

More information

Importing and Merging Data Tutorial

Importing and Merging Data Tutorial Importing and Merging Data Tutorial Release 1.0 Golden Helix, Inc. February 17, 2012 Contents 1. Overview 2 2. Import Pedigree Data 4 3. Import Phenotypic Data 6 4. Import Genetic Data 8 5. Import and

More information

Standard Sequencing Service Data File Formats

Standard Sequencing Service Data File Formats Standard Sequencing Service Data File Formats File format v2.0 Software v2.0 March 2012 CGA Tools, cpal, and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries. All other

More information

Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING)

Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) Reporting guideline statement for HLA and KIR genotyping data generated via Next Generation Sequencing (NGS) technologies and analysis

More information

Part 1: How to use IGV to visualize variants

Part 1: How to use IGV to visualize variants Using IGV to identify true somatic variants from the false variants http://www.broadinstitute.org/igv A FAQ, sample files and a user guide are available on IGV website If you use IGV in your publication:

More information

Data File Formats File format v1.4 Software v1.9.0

Data File Formats File format v1.4 Software v1.9.0 Data File Formats File format v1.4 Software v1.9.0 Copyright 2010 Complete Genomics Incorporated. All rights reserved. cpal and DNB are trademarks of Complete Genomics, Inc. in the US and certain other

More information

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics

More information

SAVANT v1.1.0 documentation CONTENTS

SAVANT v1.1.0 documentation CONTENTS SAVANT v1.1.0 documentation CONTENTS 1 INTRODUCTION --------------2 2 INSTALLATION --------------2 3 RUNNING SAVANT ------------3 4 CONFIGURATION FILE --------4 5 INPUT FILE ----------------5 6 REFERENCE

More information

Introduction to GEMINI

Introduction to GEMINI Introduction to GEMINI Aaron Quinlan University of Utah! quinlanlab.org Please refer to the following Github Gist to find each command for this session. Commands should be copy/pasted from this Gist https://gist.github.com/arq5x/9e1928638397ba45da2e#file-gemini-intro-sh

More information

User Guide. v Released June Advaita Corporation 2016

User Guide. v Released June Advaita Corporation 2016 User Guide v. 0.9 Released June 2016 Copyright Advaita Corporation 2016 Page 2 Table of Contents Table of Contents... 2 Background and Introduction... 4 Variant Calling Pipeline... 4 Annotation Information

More information

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL QIAseq DNA V3 Panel Analysis Plugin USER MANUAL User manual for QIAseq DNA V3 Panel Analysis 1.0.1 Windows, Mac OS X and Linux January 25, 2018 This software is for research purposes only. QIAGEN Aarhus

More information

MiSeq Reporter Amplicon DS Workflow Guide

MiSeq Reporter Amplicon DS Workflow Guide MiSeq Reporter Amplicon DS Workflow Guide For Research Use Only. Not for use in diagnostic procedures. Introduction 3 Amplicon DS Workflow Overview 4 Optional Settings for the Amplicon DS Workflow 7 Analysis

More information

CLC Server. End User USER MANUAL

CLC Server. End User USER MANUAL CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark

More information

ChIP-seq (NGS) Data Formats

ChIP-seq (NGS) Data Formats ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/

More information

HML Data Dictionary :24:16 CDT

HML Data Dictionary :24:16 CDT HML Data Dictionary 2011-06-22 16:24:16 CDT Table of Contents HML Data Dictionary...1 HML Version 0.3.3...2 HML Version 0.3...8 HML Version 0.2...9 Specialized Data Types...12 NMDP ID...12 Center Code...12

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational

More information

MiSeq Reporter TruSight Tumor 15 Workflow Guide

MiSeq Reporter TruSight Tumor 15 Workflow Guide MiSeq Reporter TruSight Tumor 15 Workflow Guide For Research Use Only. Not for use in diagnostic procedures. Introduction 3 TruSight Tumor 15 Workflow Overview 4 Reports 8 Analysis Output Files 9 Manifest

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013 RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013 RNAseq overview RNAseq overview Choose technology 454 Illumina SOLiD 3 rd generation (Ion Torrent, PacBio) Library types Single reads

More information

MPG NGS workshop I: Quality assessment of SNP calls

MPG NGS workshop I: Quality assessment of SNP calls MPG NGS workshop I: Quality assessment of SNP calls Kiran V Garimella (kiran@broadinstitute.org) Genome Sequencing and Analysis Medical and Population Genetics February 4, 2010 SNP calling workflow Filesize*

More information

Package seqcat. March 25, 2019

Package seqcat. March 25, 2019 Package seqcat March 25, 2019 Title High Throughput Sequencing Cell Authentication Toolkit Version 1.4.1 The seqcat package uses variant calling data (in the form of VCF files) from high throughput sequencing

More information

An Introduction to VariantTools

An Introduction to VariantTools An Introduction to VariantTools Michael Lawrence, Jeremiah Degenhardt January 25, 2018 Contents 1 Introduction 2 2 Calling single-sample variants 2 2.1 Basic usage..............................................

More information

Package saascnv. May 18, 2016

Package saascnv. May 18, 2016 Version 0.3.4 Date 2016-05-10 Package saascnv May 18, 2016 Title Somatic Copy Number Alteration Analysis Using Sequencing and SNP Array Data Author Zhongyang Zhang [aut, cre], Ke Hao [aut], Nancy R. Zhang

More information

Lecture 12. Short read aligners

Lecture 12. Short read aligners Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola

More information

User s Guide Release 3.3

User s Guide Release 3.3 [1]Oracle Healthcare Translational Research User s Guide Release 3.3 E91297-01 October 2018 Oracle Healthcare Translational Research User's Guide, Release 3.3 E91297-01 Copyright 2012, 2018, Oracle and/or

More information

Step-by-Step Guide to Basic Genetic Analysis

Step-by-Step Guide to Basic Genetic Analysis Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control

More information

Handling sam and vcf data, quality control

Handling sam and vcf data, quality control Handling sam and vcf data, quality control We continue with the earlier analyses and get some new data: cd ~/session_3 wget http://wasabiapp.org/vbox/data/session_4/file3.tgz tar xzf file3.tgz wget http://wasabiapp.org/vbox/data/session_4/file4.tgz

More information

v0.3.0 May 18, 2016 SNPsplit operates in two stages:

v0.3.0 May 18, 2016 SNPsplit operates in two stages: May 18, 2016 v0.3.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.

More information

User's Guide to DNASTAR SeqMan NGen For Windows, Macintosh and Linux

User's Guide to DNASTAR SeqMan NGen For Windows, Macintosh and Linux User's Guide to DNASTAR SeqMan NGen 12.0 For Windows, Macintosh and Linux DNASTAR, Inc. 2014 Contents SeqMan NGen Overview...7 Wizard Navigation...8 Non-English Keyboards...8 Before You Begin...9 The

More information

BioBin User Guide Current version: BioBin 2.3

BioBin User Guide Current version: BioBin 2.3 BioBin User Guide Current version: BioBin 2.3 Last modified: April 2017 Ritchie Lab Geisinger Health System URL: http://www.ritchielab.com/software/biobin-download Email: software@ritchielab.psu.edu 1

More information

Variant calling using SAMtools

Variant calling using SAMtools Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel

More information

A manual for the use of mirvas

A manual for the use of mirvas A manual for the use of mirvas Authors: Sophia Cammaerts, Mojca Strazisar, Jenne Dierckx, Jurgen Del Favero, Peter De Rijk Version: 1.0.2 Date: July 27, 2015 Contact: peter.derijk@gmail.com, mirvas.software@gmail.com

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM). Release Notes Agilent SureCall 3.5 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

Tutorial on gene-c ancestry es-ma-on: How to use LASER. Chaolong Wang Sequence Analysis Workshop June University of Michigan

Tutorial on gene-c ancestry es-ma-on: How to use LASER. Chaolong Wang Sequence Analysis Workshop June University of Michigan Tutorial on gene-c ancestry es-ma-on: How to use LASER Chaolong Wang Sequence Analysis Workshop June 2014 @ University of Michigan LASER: Loca-ng Ancestry from SEquence Reads Main func:ons of the so

More information

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline Supplementary Information Detecting and annotating genetic variations using the HugeSeq pipeline Hugo Y. K. Lam 1,#, Cuiping Pan 1, Michael J. Clark 1, Phil Lacroute 1, Rui Chen 1, Rajini Haraksingh 1,

More information

Local Run Manager Amplicon Analysis Module Workflow Guide

Local Run Manager Amplicon Analysis Module Workflow Guide Local Run Manager Amplicon Analysis Module Workflow Guide For Research Use Only. Not for use in diagnostic procedures. Overview 3 Set Parameters 4 Analysis Methods 6 View Analysis Results 9 Analysis Report

More information

QUADGT: Joint Genotyping of Parental, Normal and Tumor Genomes User s Guide

QUADGT: Joint Genotyping of Parental, Normal and Tumor Genomes User s Guide QUADGT: Joint Genotyping of Parental, Normal and Tumor Genomes User s Guide Miklós Csűrös Department of Computer Science and Operations Research Université de Montréal Montréal, Québec, Canada February

More information

Exome sequencing. Jong Kyoung Kim

Exome sequencing. Jong Kyoung Kim Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic

More information

Genetic Analysis. Page 1

Genetic Analysis. Page 1 Genetic Analysis Page 1 Genetic Analysis Objectives: 1) Set up Case-Control Association analysis and the Basic Genetics Workflow 2) Use JMP tools to interact with and explore results 3) Learn advanced

More information

The European Variation Archive

The European Variation Archive The European Variation Archive Webinar: A database of all types of genomic variation data from all species Hannah McLaren www.ebi.ac.uk/eva eva-helpdesk@ebi.ac.uk Learning objectives Establish the key

More information

CircosVCF workshop, TAU, 9/11/2017

CircosVCF workshop, TAU, 9/11/2017 CircosVCF exercise In this exercise, we will create and design circos plots using CircosVCF. We will use vcf files of a published case "X-linked elliptocytosis with impaired growth is related to mutated

More information

Infinium iselect Custom Genotyping Assays Guidelines for using the DesignStudio Microarray Assay Designer software to create and order custom arrays.

Infinium iselect Custom Genotyping Assays Guidelines for using the DesignStudio Microarray Assay Designer software to create and order custom arrays. Infinium iselect Custom Genotyping Assays Guidelines for using the DesignStudio Microarray Assay Designer software to create and order custom arrays. Introduction The Illumina Infinium Assay enables highly

More information

PyVCF Documentation. Release James

PyVCF Documentation. Release James PyVCF Documentation Release 0.6.0 James Casbon, @jdoughertyii July 04, 2012 CONTENTS i ii Contents: CONTENTS 1 2 CONTENTS CHAPTER ONE INTRODUCTION A VCFv4.0 and 4.1 parser for Python. Online version of

More information

Under the Hood of Alignment Algorithms for NGS Researchers

Under the Hood of Alignment Algorithms for NGS Researchers Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window

More information

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

SAMtools.   SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call SAMtools http://samtools.sourceforge.net/ SAM/BAM mapping BAM SAM BAM BAM sort & indexing (ex: IGV) mapping SNP call SAMtools NGS Program: samtools (Tools for alignments in the SAM format) Version: 0.1.19

More information

PreMeta GENERAL INFORMATION SYNOPSIS

PreMeta GENERAL INFORMATION SYNOPSIS PreMeta GENERAL INFORMATION PreMeta is a software program written in C++ that is designed to facilitate the exchange of information between four software packages for meta-analysis of rare-variant associations:

More information

BaseSpace Variant Interpreter Release Notes

BaseSpace Variant Interpreter Release Notes v.2.5.0 (KN:1.3.63) Page 1 of 5 BaseSpace Variant Interpreter Release Notes BaseSpace Variant Interpreter v2.5.0 FOR RESEARCH USE ONLY 2018 Illumina, Inc. All rights reserved. Illumina, BaseSpace, and

More information

RNA Sequencing with TopHat and Cufflinks

RNA Sequencing with TopHat and Cufflinks RNA Sequencing with TopHat and Cufflinks Introduction 3 Run TopHat App 4 TopHat App Output 5 Run Cufflinks 18 Cufflinks App Output 20 RNAseq Methods 27 Technical Assistance ILLUMINA PROPRIETARY 15050962

More information

Package SimGbyE. July 20, 2009

Package SimGbyE. July 20, 2009 Package SimGbyE July 20, 2009 Type Package Title Simulated case/control or survival data sets with genetic and environmental interactions. Author Melanie Wilson Maintainer Melanie

More information

Exeter Sequencing Service

Exeter Sequencing Service Exeter Sequencing Service A guide to your denovo RNA-seq results An overview Once your results are ready, you will receive an email with a password-protected link to them. Click the link to access your

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight Resequencing Analysis (Pseudomonas aeruginosa MAPO1 ) 1 Workflow Import NGS raw data Trim reads Import Reference Sequence Reference Mapping QC on reads Variant detection Case Study Pseudomonas aeruginosa

More information

epigenomegateway.wustl.edu

epigenomegateway.wustl.edu Everything can be found at epigenomegateway.wustl.edu REFERENCES 1. Zhou X, et al., Nature Methods 8, 989-990 (2011) 2. Zhou X & Wang T, Current Protocols in Bioinformatics Unit 10.10 (2012) 3. Zhou X,

More information

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012 SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................

More information

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun Genomes On The Cloud GotCloud University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun Friday, March 8, 2013 Why GotCloud? Connects sequence analysis tools together Alignment, quality

More information

USING BRAT UPDATES 2 SYSTEM AND SPACE REQUIREMENTS

USING BRAT UPDATES 2 SYSTEM AND SPACE REQUIREMENTS USIN BR-1.1.17 1 UPDES In version 1.1.17, we fixed a bug in acgt-count: in the previous versions it had only option -s to accept the file with names of the files with mapping results of single-end reads;

More information

DNA Sequencing analysis on Artemis

DNA Sequencing analysis on Artemis DNA Sequencing analysis on Artemis Mapping and Variant Calling Tracy Chew Senior Research Bioinformatics Technical Officer Rosemarie Sadsad Informatics Services Lead Hayim Dar Informatics Technical Officer

More information

ELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2

ELAI user manual. Yongtao Guan Baylor College of Medicine. Version June Copyright 2. 3 A simple example 2 ELAI user manual Yongtao Guan Baylor College of Medicine Version 1.0 25 June 2015 Contents 1 Copyright 2 2 What ELAI Can Do 2 3 A simple example 2 4 Input file formats 3 4.1 Genotype file format....................................

More information

v0.3.2 March 29, 2017

v0.3.2 March 29, 2017 March 29, 2017 v0.3.2 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP3.1 positions.

More information

freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger of Iowa May 19, 2015

freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger of Iowa May 19, 2015 freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger Institute @University of Iowa May 19, 2015 Overview 1. Primary filtering: Bayesian callers 2. Post-call filtering:

More information

The SAM Format Specification (v1.3 draft)

The SAM Format Specification (v1.3 draft) The SAM Format Specification (v1.3 draft) The SAM Format Specification Working Group July 15, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

discosnp++ Reference-free detection of SNPs and small indels v2.2.2

discosnp++ Reference-free detection of SNPs and small indels v2.2.2 discosnp++ Reference-free detection of SNPs and small indels v2.2.2 User's guide November 2015 contact: pierre.peterlongo@inria.fr Table of contents GNU AFFERO GENERAL PUBLIC LICENSE... 1 Publication...

More information

Package RVS0.0 Jiafen Gong, Zeynep Baskurt, Andriy Derkach, Angelina Pesevski and Lisa Strug October, 2016

Package RVS0.0 Jiafen Gong, Zeynep Baskurt, Andriy Derkach, Angelina Pesevski and Lisa Strug October, 2016 Package RVS0.0 Jiafen Gong, Zeynep Baskurt, Andriy Derkach, Angelina Pesevski and Lisa Strug October, 2016 The Robust Variance Score (RVS) test is designed for association analysis for next generation

More information

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information