TCGA Variant Call Format (VCF) 1.0 Specification

Size: px

Start display at page:

Download "TCGA Variant Call Format (VCF) 1.0 Specification"

Domenic Boone
5 years ago
Views:

1 TCGA Variant Call Format (VCF) 1.0 Specification Document Information Specification for TCGA Variant Call Format (VCF) Version About TCGA VCF specification 2 TCGA-specific customizations 3 File format 3.1 HEADER Generic meta-information INFO/FORMAT/FILTER meta-information TCGA-specific meta-information Column header meta-information 3.2 BODY Variant records 4 Extensions for TCGA data 4.1 Structural variants 4.2 Complex rearrangements 4.3 RNA-Seq variants 5 Validation rules 5.1 Handling failed checks 5.2 Test files Contents About TCGA VCF specification Variant Call Format (VCF) is a format for storing and reporting genomic sequence variations. VCF files are modular where the annotations and genotype information for a variant are separated from the call itself. As of May 2011, VCF version 4.1 (described here) is the most recent release. GSCs will generate sequence variation data using high-throughput sequencing technologies and resulting variations will be submitted to DCC as VCF files. TCGA has adopted VCF 4.1 with certain modifications to support supplemental information specific to the project. Subsequent sections describe the format TCGA VCF files should follow and validation steps that would have to be implemented at the DCC. TCGA-specific customizations The VCF 4.1 specification has been customized to support TCGA-specific variant information. While majority of the steps pertaining to the basic structure of the file remain the same, checks for supplemental information fields have been introduced. For example, TCGA VCF specification allows for additional fields to represent data associated with complex rearrangements, RNA-Seq variants, and sample-specific metadata. All TCGA-specific additions and modifications in validation steps are prefixed with a <TCGA-VCF> tag for convenient comparison with the 1000Genomes VCF 4.1 validator specification. The following table summarizes TCGA-specific customizations that have been added to the VCF 4.1 specification. The first column, "Customization type", indicates whether a new validation step has been introduced or if an existing step has been modified Table 1: TCGA-specific validation steps Customization type New Description Validation step # in TCGA-VCF 1.0 spec Validate that file contains ##tcgaversion HEADER line. Its presence indicates that the file is TCGA VCF and the value assigned to the field contains format version number Corresponding validation step # in VCF 4.1 spec New Additional mandatory header lines (Please refer to Table 2 ) #1 #1 New Validation of SAMPLE meta-information lines #15 New Validation of PEDIGREE meta-information lines #16

Modification Acceptable value set for CHROM has been modified #18a,b #16a Modification Acceptable value set for ALT has been modified #19 #17 New Validation for INFO sub-field "VT" has been added #22

2 Modification Acceptable value set for CHROM has been modified #18a,b #16a Modification Acceptable value set for ALT has been modified #19 #17 New Validation for INFO sub-field "VT" has been added #22 New Validation for FORMAT sub-field "SS" has been added #23 New Validation for INFO/FORMAT sub-field "DP" has been added #24 New Validation for complex rearrangement records has been added #25 New Validation for RNA-Seq annotation fields has been added #26 File format The following example (based on VCF version 4.1) shows different components of a TCGA VCF file. Any VCF file contains two main sections. The HEADER section contains meta-information for variant records that are reported as individual rows in the BODY of the VCF file. Both sections are described below. Case-sensitivity: Please note that all fields and their associated validation rules are case-sensitive (as given in the specification) unless noted otherwise. Figure 1: Components of a sample TCGA VCF file HEADER The HEADER contains meta-information lines that provide supplemental information about variants contained in BODY of the file. HEADER lines could be formatted in the following two ways: ##key=value ##fileformat=vcfv4.1 ##filedate= or

3 ##FIELDTYPE=<key1=value1,key2=value2,...> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> Meta-information could be applicable either to all variant records in the file (e.g., date of creation of file) or to individual variants (e.g., flag to indicate whether a given variant exists in dbsnp). Generic meta-information Format: ##key=value OR ##FIELDTYPE=<key1=value1,key2=value2,...> The following table lists some of the reserved field names. Files can be customized to contain additional meta-information fields as long as they are not in conflict with reserved field names. The first field in Table 2 (fileformat) is mandatory and lists the VCF version number of the file. Table 2: Examples of generic meta-information fields Field Case-sensitive Description Sample values Required (fields in red are TCGA-specific requirements) fileformat Lists the VCF version number the file is based on; must be the first line in the file filedate Date file was created; should be in yyyymmdd format tcgaversion Indicates that the file follows TCGA-VCF specification. Format version number is assigned to the field. reference Reference build used for variant calling and against which variant coordinates are shown ##fileformat=vcfv4.1 ##filedate= ##tcgaversion=1.0 ##reference=1000genomespilot-ncbi36 OR ##reference=<id=hg18, Source= file://seq/references/1000genomespilot-ncbi36.fasta assembly External assembly file ##assembly=ftp://ftp-trace.ncbi.nih.gov/ 1000genomes/ftp/release/sv/breakpoint_assemblies.fasta (if a contig from an assembly file is being referred to in the VCF file, especially for breakends) center Name of the center where VCF file is generated. A comma-separated list can be provided if files from multiple centers are merged. phasing Indicates whether genotype calls are partially phased (phasing=partial) or unphased (phasing=none) ##center="broad" OR ##center="broad,ucsc,bcm" ##phasing=none

4 geneanno URL of the gene annotation source e.g., Generic Annotation File (GAF) vcfprocesslog Lists algorithm, version and settings used to generate variant calls in a VCF file. If multiple VCF files are processed to produce a single merged file, the field records attributes for individual VCF files and the programs used to merge the files along with the associated version, parameters and contact information of the person who produced the merged file. te: If VCF file does not represent a set of merged files, MergeSoftware, MergeParam, MergeVer and MergeContact tags will not be applicable and can be omitted. INDIVIDUAL Specifies the individual for which data is presented in the file ##geneanno= /GAF_bundle_Feb2011/outputs/TCGA.hg18.Feb2011.gaf ##vcfprocesslog=<inputvcf=<file1.vcf>, InputVCFSource=<varCaller1>, InputVCFVer=<1.0>, InputVCFParam=<a1,c2> InputVCFgeneAnno=<anno1.gaf>> OR ##vcfprocesslog=<inputvcf=<file1.vcf,file2.vcf,file3.vcf>, InputVCFSource=<varCaller1,varCaller2,varCaller3>, InputVCFVer=<1.0,2.1,2.0>, InputVCFParam=<a1,c2;a1,b1;a1,b1>, InputVCFgeneAnno=<anno1.gaf,anno2.gaf,anno3.gaf>, MergeSoftware=<sw1,sw2>, MergeParam=<a1,a2;b1,b2>, MergeVer=<2.1,3.0>, MergeContact=<johndoe@xyz.edu>> ##INDIVIDUAL=TCGA (if annotation tags like GENE, SID and RGN are used) INFO/FORMAT/FILTER meta-information Format: ##FIELDTYPE=<key1=value1,key2=value2,...> INFO, FORMAT and FILTER (case-sensitive values) are optional fields that have to be declared in the HEADER if they are being referred to in BODY of the file. Different keys that can be used to define them are described in Table 3. All three fields do not use the same set of keys. Please refer to individual field definitions for further details. Table 3: Description of keys used in INFO/FORMAT/FILTER meta-information declarations Key Case-sensitive Description Data type (Possible values) Additional notes ID name of the field; also used in BODY of the file to assign values for individual variant records Number specifies the number of values that can be associated with the corresponding field Type indicates data type of the value associated with the field String, no whitespaces, no comma Set (Integer >= 0, "A", "G", ".") Set (Integer, Float, Flag, Character, String) Any integer >= 0 indicating number of values; "A", if the field has one value per alternate allele; "G", if the field has one value per genotype; ".", if number of values varies, is unknown, or is unbounded "Flag" type indicates that the field does not contain a value entry, and hence the Number should be 0 in this case. FORMAT fields cannot have a "Flag" Type assigned to them.

5 Description provides a brief description of the field String, surrounded by double-quotes, cannot itself contain a double-quote, cannot contain trailing whitespace at the end of string before closing quotes INFO lines Format: ##INFO=<key1=value1,key2=value2,...> Required keys: ID, Type, Number, Description INFO fields are optional and contain additional annotations for a variant. Certain INFO fields have already been created and exist as reserved fields in the current VCF standard. Custom INFO fields can be added based on study requirements as long as they do not use the reserved field names. If an INFO field is declared in the header, it needs to be described further using the following format: ##INFO=<ID=ID,Number=number,Type=type,Description= description > ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> FORMAT lines Format: ##FORMAT=<key1=value1,key2=value2,...> Required keys: ID, Type, Number, Description FORMAT declaration lines are optional and are used when annotations need to be added for individual genotypes associated with each sample in the file. FORMAT sub-fields are declared precisely as the INFO sub-fields with the exception that a FORMAT sub-field cannot be assigned a "Flag" Type. ##FORMAT=<ID=ID,Number=number,Type=type,Description= description > ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> FILTER lines Format: ##FILTER=<key1=value1,key2=value2,...> Required keys: ID, Description FILTER fields are defined to list filtering criteria used for generating variant calls. Custom filters can be applied as long as a definition is provided in the HEADER. FILTERs that have been applied to the data should be described as follows. Please note that FILTER declarations do not include Type or Number keys. ##FILTER=<ID=ID,Description= description > ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> TCGA-specific meta-information PEDIGREE lines Format: ##PEDIGREE=<key1=value1,key2=value2,...> Required keys: Name_0,..,Name_N where N >= 1; PEDIGREE lines are used to specify derivation relationships between different genomes. Name_0 is associated with the derived genome and

6 Name_1 through Name_N represent the genomes from which it is derived. In the case of tumor clonal populations, one population is clonally derived from another. In the example below, PRIMARY-TUMOR-GENOME is derived from GERMLINE-GENOME. ##PEDIGREE=<Name_0=<G0-ID>,Name_1=<G1-ID>,...,Name_N=<GN-ID>> where N is >= 1; ##PEDIGREE=<Name_0=PRIMARY-TUMOR-GENOME,Name_1=GERMLINE-GENOME> SAMPLE lines Format: ##SAMPLE=<key1=value1,key2=value2,...> Required keys: ID, Individual, File, Platform, Source, Accession SAMPLE lines are used to include additional metadata about each sample for which data is represented in the VCF file. All samples are listed in the column header line following the FORMAT column (Figure 1). Each of these samples should have its own HEADER declaration where the sample identifier in the column header should be the same as the value assigned to "ID" key in the corresponding declaration. The declaration lists information about the sample (source, platform, source file, etc.) and can also be used to indicate if the sample is a mixture of different kind of genomes. In the example below, "Genomes", "Mixture" and 'Genome_Description" tags represent comma-separated list of different genomes that a sample contains, proportion of each genome in the sample, and a brief description of each genome respectively. ##SAMPLE=<ID=id,SampleName=sampleName,Individual=individual,Description="description",File=bamfile,Platfo contamination","tumor genome">> "Description" field for genome mixture has been renamed to "Genome_Description" to distinguish it from sample description. Values for tags related to genome mixture (Genomes, Mixture, Genome_Description) are within angle brackets. Column header meta-information Format: Tab-delimited line starting with "#" and containing headers for all columns in the BODY as shown below. This is a mandatory header line where the first 8 fields are fixed and have to defined in the column header. "FORMAT" onwards are optional and are included to encapsulate per-sample/genome genotype data. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT <SAMPLE1 or GENOME1> <SAMPLE2 or GENOME2>... BODY Variant records Data lines are tab-delimited and list information about individual variants and associated genotypes across samples. The first 8 fields (Figure 1) are required to be listed in the VCF column header line. Some of these fields require non-null values (see Table 4) for each record. For the remaining fixed fields, even if the field does not have an associated value, it still needs to be specified with a missing value identifier ("." in VCF 4.1). Subsequent fields are optional. Table 4: Description of fields in the BODY of a VCF file Index Field Case-sensitive Description Data type (Possible values) Sample values Required* Additional notes

7 1 CHROM Chromosome: an identifier from the reference genome or the assembly file defined in the HEADER. Alphanumeric string ([1-22], X, Y, MT, <ID>) 20 <ctg1> Chromosome name should not contain "chr" prefix, e.g., "chr10" will be an invalid entry 2 POS Position: The reference position, with the 1st base having position 1. n-negative integer ID Identifier: Semi-colon separated list of unique identifiers if available. String, no white-space or semi-colons rs REF Reference allele(s) : Reference allele at the position. String ([ACGTN]+ ) GTCT Value in POS field refers to the position of the first base in the REF string. 5 ALT Alternate allele(s) : Comma separated list of alternate non-reference alleles called on at least one of the samples. Angle-bracketed ID String ( <ID> ) can also be used for symbolically representing alternate alleles. String; no whitespace, commas, or angle-brackets in the ID string ([ACGTN]+, < ID>,.) G,GTCT. <INS:ME:ALU> if ALT==<ID>, ID needs to be defined in the header as ##ALT=<ID=Id,Description= "description"> 6 QUAL Quality score: Phred-scaled quality score for the assertion made in ALT. Integer >= 0 50 Scores should be non-negative integers or missing values 7 FILTER Filtering results: PASS if this position has passed all filters, Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. String, no whitespace or semi-colon PASS q10;s50 "0" is reserved and cannot be used as a filter String. 8 INFO Additional information: INFO fields are encoded as a semicolon-separated series of keys (same as ID in an INFO declaration) with optional values in the format <key=value>. String, no whitespace, semi-colons, or equal-signs NS=3;DP=14; 9 FORMAT Genotype sub-fields: If genotype data is present in the file, the fixed fields are followed by a FORMAT column. The field contains a colon-separated list of all pre-defined FORMAT sub-fields (same as ID in a FORMAT declaration) that are applicable to all samples that follow. String, no whitespace, sub-fields cannot contain colon GT:GQ:DP:HQ "GT" must be the first sub-field if it is present in the FORMAT field.

8 10 <SAMPLE> Case should be same as in "ID" tag of ##SAMPLE declaration in the header Per-sample genotype information: An arbitrary number of sample IDs can be added to the column header line and a variant record in the BODY can contain genotype information corresponding to FORMAT column for each sample. Contains a colon-separated list of values assigned to each of the sub-fields in FORMAT column. String, no whitespace, sub-fields cannot contain colon 0 0:48:1:51,51 Values are assigned to FORMAT sub-fields in the SAME order as specified in "FORMAT" column. All samples in any given row for a variant record MUST contain values for all sub-fields as defined in "FORMAT" column. If any of the fields does not have an associated value, then missing value identifier (".") should be used for that field. However, "." cannot be used as a value for any of the IDs in the FORMAT field (e.g., GT:.:DP would lead to an error). * A "Required" field cannot contain missing value identifier for any record listed in data lines. Extensions for TCGA data TCGA data includes but is not limited to SNP's and small indels. A variant representation format for cancer data should be able to support more complex variation types such as structural variants, complex rearrangements and RNA-Seq variants. The following sub-sections present an overview of the extensions that have been added to clearly describe such variations in a VCF file. Structural variants A structural variant (SV) can be defined as a region of DNA that includes a variation in the structure of the chromosome. Such variations could be due to inversions and balanced translocations or genomic imbalances (insertions and deletions), also referred to as copy number variants (CNVs). Certain features have been added to the format in order to clearly describe structural variants in a VCF file. A detailed description of the extensions is available here. Complex rearrangements Chromosomal rearrangements are caused by breakage of DNA double helices at two different locations. The broken ends in turn rejoin to produce a new chromosomal arrangement. Complex rearrangements involving more than two breaks are frequently observed in cancer genomes. Certain modifications need to be made to the VCF standard to adequately represent such variations in a VCF file. A detailed specification of the proposed extensions to describe rearrangements in a VCF file is available here. Figure 2 illustrates some of the concepts relevant to VCF records for complex rearrangements. Figure 2: Adjacencies and breakends in a chromosomal rearrangement (adapted from VCF 4.1 specification)

9 A VCF file has one line for each of the two breakends in an adjacency. Table 5 provides a list of sub-fields that have been added to describe breakends. An INFO sub-field ( SVTYPE=BND) is used to indicate a breakend record. Sub-fields MATEID and PARID are used to represent variant record IDs of corresponding mates and partners respectively. Table 5: Fields added for breakends Field:Sub-field Description Declaration in HEADER (Sample values in BODY) Required INFO: SVTYPE Type of structural variant; SVTYPE is set to "BND" for breakend records ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> SVTYPE=BND (SVTYPE=BND for breakend records) INFO: MATEID ID of corresponding mate of the breakend record ##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakend"> MATEID=bnd_U INFO: PARID ID of corresponding partner of the breakend record ##INFO=<ID=PARID,Number=.,Type=String,Description="ID of partner breakend"> PARID=bnd_V INFO: EVENT ID of event associated to breakend ##INFO=<ID=EVENT,Number=.,Type=String,Description="ID of breakend event"> EVENT=RR0 The specification for ALT field deviates from the standard format for breakend records. ALT field for a breakend record can be represented in four possible ways based on the type of replacement. REF ALT Description s t[p[ piece extending to the right of p is joined after t s t]p] reverse comp piece extending left of p is joined after t s ]p]t piece extending to the left of p is joined before t s [p[t reverse comp piece extending right of p is joined before t Legend: s: sequence of REF bases beginning at position POS t: sequence of bases that replaces "s" p: position of the breakend mate indicating the first mapped base that joins at the adjacency; represented as a string of the form "chr:pos" []: square brackets indicate direction that the joined sequence continues in, starting from p

10 RNA-Seq variants VCF specifications have been extended to address expressed variants obtained from RNA-Seq. Features added for structural variants from genome/exome sequencing are applicable to RNA-Seq structural variants. However, RNA-Seq breakends are represented by setting SVTYPE=FND instead of BND (Table 6) since they can be different from those observed in DNA-Seq. Table 6: Fields added for RNA-Seq variants Field:Sub-field Description Declaration in HEADER (Sample values in BODY) Required INFO: SVTYPE Type of structural variant; SVTYPE is set to "FND" for breakends associated with RNA-Seq ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> SVTYPE=FND (required for RNA-Seq breakend records; SVTYPE=FND) VCF files for RNA-Seq variants may include gene-related annotations. However, this is not a standard feature of VCF files as eventually all VCF variants will be annotated using information in Generic Annotation File (GAF). Additional INFO and FORMAT sub-fields have been included to describe the characteristics of expressed nucleotide variants (Table 6a). Table 6a: Annotation fields added for RNA-Seq variants Field:Sub-field Description Declaration in HEADER (Sample values in BODY) Required INFO:SID Unique identifiers from the gene annotation source as specified in ##geneanno; "unknown" should be used if identifier is not known; comma-separated list of IDs can be used if variant overlaps with multiple features ##INFO=<ID=SID,Number=.,Type=String,Description= Unique identifier from gene annotation source or unknown > SID=13,198 INFO:GENE HUGO gene symbol; "unknown" should be used when gene symbol is unknown; comma-separated list of genes can be used if variant overlaps with multiple transcripts/genes ##INFO=<ID=GENE,Number=.,Type=String,Description= HUGO gene symbol > GENE=ERBB2,ERBB2 INFO:RGN Region where a nucleotide variant occurs in relation to a gene ##INFO=<ID=RGN,Number=.,Type=String,Description= Region where nucleotide variant occurs in relation to a gene > RGN=exon,3_utr INFO:RE Flag to indicate if position is known to have RNA-edits occur ##INFO=<ID=RE,Number=0,Type=Flag,Description= Position known to have RNA-edits to occur > RE FORMAT:TE Translational effect of a nucleotide variant in a codon ##FORMAT=<ID=TE,Number=.,Type=String,Description="Translational effect of the variant in a codon"> MIS,NA Validation rules At the minimum, every file needs to go through the checks listed below. Following is an example of a VCF file that shows certain violations cited in the listed validation steps. Please note that line numbers in the file segment below are added for illustration purposes alone and are not expected to be found in an actual VCF file.

11 Line1 ##fileformat=vcfv4.1 Line2 ##filedate= Line3 ##source=myimputationprogramv3.1 Line4 ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta Line5 ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> Line6 ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership"> Line7 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> Line8 ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> Line9 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> Line10 ##FORMAT=<ID=PL,Number=3,Type=Integer,Description=" rmalized Phred-scaled likelihoods for AA, AB, BB genotypes "> Line11 ##FILTER=<ID=q10,Description="Quality below 10"> Line12 ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> Line13 FILTER=<ID=c10,Description="Shallow coverage below 10x"> Line14 ##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element"> Line15 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCGA TCGA Line var1 G A 29 q10 NS=2;DP=14 GT:GQ:DP 0 0:48 0 1:48:3 Line var2 G A 35 q10;s50 NS=2.5 GQ:GT 48:0 0 51:0 1 Line var3 C T 30 q10;s10 NS=2 GT:GQ:DP 0/2:48:3 0/1:51:4 Line rs123 C <DEL:ME:ALU> 12 PASS NS=3;DB GT:GQ 0/1:50 1/1:40 Line A <DUP> 20 PASS NS=3 GT:GQ:PL 0/1:49:42,3 1/1:38:96,47/70 Line rs456 T C 15 PASS NS=3/DB GT 0/1 1/1 Important: A file will be validated as a TCGA VCF file only if it contains ##tcgaversion HEADER line (e.g., ##tcgaversion=1.0). The current acceptable version is Mandatory header lines should be present. All meta-information header lines should be prefixed with "##". Column header line should be prefixed with "#". A VCF file can contain only a single column header line that must contain all required field names. Any line lacking the "##" or "#" prefix will be assumed to be a BODY data line and will have to follow the specified format. For example, Line13 leads to a violation as it lacks "##" or "#" but is not a tab-delimited row containing variant information. HEADER lines cannot be present within the BODY of a file and vice-versa. INFO, FORMAT and FILTER declarations should follow the format below where all keys are required but the order of keys is irrelevant. ##INFO=<ID=id,Number=number,Type=type,Description="description"> ##FORMAT=<ID=id,Number=number,Type=type,Description="description"> ##FILTER=<ID=id,Description="description"> 7. Values assigned to ID, Number, Type and Description in INFO, FORMAT or FILTER declarations should follow the rules listed below. A detailed description of the declaration format is provided here. a. ID, Number, Type!~ /(\s, = ;)/ b. Number is in {Integer>=0, "A", "G", "."} c. Type is in {Integer, String, Float, Flag, Character} d. Description should be within double quotes and cannot itself contain a double quote e. Description string cannot contain leading or trailing whitespace after opening or before closing quotation marks; Line10 shows a violation as Description string contains leading and trailing whitespace. f. If ID == "FORMAT", then Type!= "Flag" 8. Any INFO, FORMAT or FILTER sub-fields used in the BODY are required to be defined in the HEADER. For example, var1 (Line16) shows an example of a violation as read depth "DP" is assigned a value (DP=14) without being defined as an INFO sub-field in the HEADER. 9. Validation of INFO sub-fields: a. An INFO sub-field should be included for a variant record in the BODY as <key=value> (e.g., NS=2) where key is the "ID" value of the sub-field in the HEADER declaration. Exception: An INFO field of "Flag" Type will not be assigned a value in the BODY. The presence of a flag in INFO column merely indicates that the variant record satisfies a condition associated with the flag. For example, Line19 has a "DB" flag without a value entry in the INFO column. "DB" in the INFO column indicates that the variant exists in dbsnp. b. Multiple INFO sub-fields can be associated with a single variant record using ";" as a separator (e.g., Line16). Line21 has a violation as "/" is used as a separator in INFO column. 10. Validation of FORMAT sub-fields: a. FORMAT column for a variant record contains a colon-separated list of all pre-defined FORMAT sub-fields (identified by "ID" value in the HEADER declaration) that are applicable to all samples that follow. A ":" is the only valid separator for sub-fields.

12 b. Number of colon-separated sub-fields in FORMAT column should equal to number of colon-separated values assigned to each sample. For example, var1 (Line16) violates this rule for the sample TCGA as there are 3 sub-fields in FORMAT column but only 2 values in the sample column. c. If "GT" sub-field is defined in the FORMAT field for a variant, it must be the first sub-field in the string. For example, var2 (Line17) violates this rule as GT is not the first sub-field even though it is present in the FORMAT field. i. GT is not a required sub-field and can be omitted for a variant row if none of the samples have genotype calls available. ii. GT represents the genotype, encoded as allele values separated by either of / (genotype unphased) or (genotype phased). The allele values are 0 for the reference allele (in REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. Examples: 0/1, 1 0, or 1/2, etc. iii. GT is assigned only one allele value for haploid calls (e.g. on Y chromosome). Therefore, if CHROM=="Y" then GT should have only one allele value assigned to it (e.g., "1", "0", ".", etc.) instead of two alleles (e.g., "1/1", "0 0"). If CHROM=="MT" then there is no constraint on the number of alleles as long as the number is bounded within the alleles listed in REF and/or ALT (e.g., 0/1, 0/1/2, 1 are all valid values for MT if REF and ALT have one and two allele values respectively). iv. If GT is present in the "FORMAT" column, then all samples should have values assigned to the field for that variant. If an allele cannot be called for a sample at a given locus,. will be specified for each missing allele in the GT field (for example "./." for a diploid genotype and "." for haploid genotype). v. Validation should include ensuring that allele number in GT is within the range of alleles specified in ALT and REF. For example, var3 (Line18) violates this rule as it lists GT as "0/2" for sample TCGA but ALT contains only one allele so the only acceptable allele numbers are 0 (REF) and 1 (ALT). If an INFO or FORMAT sub-field is declared in the header AND is assigned a value for a variant record in the body, the data type should be consistent with the expected type defined in the Type key of the corresponding declaration. For example, var2 (Line17) violates this rule as the definition for "NS" INFO sub-field states the data type is integer whereas the variant record contains a float value (2.5) assigned to the sub-field. Exception: The rule does not apply if Type of a field is not defined or is incorrectly defined (e.g., field not declared in HEADER, Type not included in declaration, incorrect value for Type). It also does not apply to any missing values (denoted with ".") in the record as they do not have an associated data type. Multiple comma-separated values (corresponding to value assigned to Number key in declaration) can be specified for an INFO or FORMAT sub-field for a variant record. other character can be used as separator. Line20 shows a violation as a "/" is used as separator between 2nd and 3rd values for "PL" FORMAT sub-field in the second sample column. If Number tag is assigned a known bounded value (an integer, "A", "G") for an INFO/FORMAT sub-field, it should be consistent with number of values specified for any variant record in BODY of file. For example, Line20 shows a violation as "PL" is associated with 3 integer values (Line10) but the variant record has only 2 comma-separated integer values (42,3) for TCGA Validation of FILTER sub-fields: a. Valid values for FILTER column are "PASS" or a code for the filter that the variant call fails (e.g., "q10" in Line16). The code must correspond to the "ID" value of the corresponding FILTER declaration. b. If a call fails multiple filters, FILTER column should contain semicolon-separated list of all failed filter codes (e.g., "q10;s50" in Line17). A ";" is the only valid separator. c. All codes listed in the FILTER column must have a well-formed declaration in the HEADER. Line18 shows a violation as "q10" does not have an associated definition in the HEADER. <TCGA-VCF> Validation of SAMPLE meta-information lines: a. Each sample ID in the column header (immediately after FORMAT column) must have an associated HEADER declaration where value assigned to "ID" tag in the declaration is the same as sample ID used in the column name. b. Declaration must contain all required fields. c. Genome mixture tags (Genomes, Mixture, Genome_Description) are enclosed within angle brackets (<>) and can have multiple comma-separated values. d. If more than one of the genome mixture tags (Genomes, Mixture, Genome_Description) are defined in a SAMPLE meta-information line, then number of comma-separated values should be the same for all defined tags. For example, "Genomes=<G1,G2>,Mixture=<0.1,0.8,0.1>" would lead to a violation as Mixture has 3 values while Genomes has only 2 values. e. Individual values in "Genomes" are strings without white-space, comma or angle brackets. f. Individual values in "Mixture" represent proportion (floating point number >= 0 and <= 1) of each genome in the sample and all comma-separated values should add up to a sum of 1. g. Individual values in "Genome_Description" are strings surrounded by double quotes where the string itself cannot contain a double quote. <TCGA-VCF> Validation of PEDIGREE meta-information lines: a. Declaration line should follow the format: ##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID> where: 17. i. ii. iii. N >= 1 Name_0 through Name_N are strings that cannot contain white-space, comma, or angle brackets. G0-ID through GN-ID are strings that cannot contain white-space, comma, or angle brackets. Each of these should be a header for the genotype columns immediately after FORMAT column and should be defined using "ID" tag in the corresponding ##SAMPLE meta-information line. Validation of custom meta-information fields: a. If a user-created custom meta-information declaration is encountered and the corresponding key/value structure and content have not been defined in this specification, the line should be validated to ensure it follows one of the following two formats:

13 ##key=value ##<INDIVIDUAL=TCGA > OR ##FIELDTYPE=<key1=value1,key2=value2,...> ##contig=<id=20,length= ,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="h sapiens",taxonomy=x> where: 18. i. ii. key!~ /(\s, = ;)/ value!~ /(\s, = ;)/ UNLESS value is within double quotes, in which case it cannot itself contain a double quote or leading/trailing whitespace OR if value is within angle brackets. CHROM, POS, and REF are required fields and cannot contain missing value identifiers. Please refer to Table 4 for acceptable values. a. <TCGA-VCF> CHROM is in {[1-22], X, Y, MT,<chr_ID>} where chr_id cannot contain whitespace or <> b. If CHROM == <chr_id> then the VCF file MUST have a declaration for assembly file in the HEADER in the following format: ##assembly=url ##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fas 19. c. POS is a non-negative integer d. REF =~ /[ACGTN]+/ <TCGA-VCF> ALT is in {[ACGTN]+, ".", <ID>, SV_ALT }; a. String SV_ALT can be in one of the following four formats and can be used in the ALT field ONLY when the corresponding INFO field has the key-value pair "SVTYPE=BND" or "SVTYPE=FND". Format seq[chr:pos[ seq]chr:pos] ]chr:pos]seq [chr:pos[seq Example G[17:198982[ GC]1:238909] ]<ctg1>:235788]gcna [1: [ACT 20. i. seq is in {[ACGTN]+, "."} ii. chr is in {<chr_id>, [1-22], X, Y, MT} where chr_id is a string iii. pos is a non-negative integer b. Similar to 18b, if chr == <chr_id> (where chr_id is a string) then the VCF file must have an ##assembly declaration in the HEADER. c. If ALT is assigned a value in <ID> format, (e.g., rs123 in Line19), <ID> should be defined in the HEADER as ##ALT=<ID= ID,Description=" Description"> (Line14) where ID cannot contain white-space or angle brackets. Line20 shows a violation of this rule as ALT== <DUP> but there is no corresponding ALT declaration in the HEADER with <ID=DUP>. d. ALT can contain multiple comma-separated values. other character can be used as a separator. two records are allowed to have the the same ID value. Two records can, however, have the same CHROM and POS values. variant record. 26. <TCGA-VCF> Validation of RNA-Seq annotation fields: a. b. c. d. where: Exception: Multiple records in a file are allowed to have the same missing value identifier (".") as ID. 21. QUAL field can only contain non-negative integers or "." (missing value). 22. <TCGA-VCF> If INFO sub-field "VT" is declared and used in the BODY, its value can only be in {SNP, INS, DEL} 23. <TCGA-VCF> If FORMAT sub-field "SS" is declared and used in the BODY, its value can be 0, 1, 2, 3, or 4 depending on whether relative to normal the variant is wildtype, germline, somatic, LOH or unknown respectively. 24. <TCGA-VCF> "DP" sub-field for read depth can be defined in INFO (combined depth across all samples) or FORMAT (depth in a specific sample) field. If both INFO and FORMAT have values for the sub-field, then sum of DP values across all FORMAT sample columns should be equal to DP value in the INFO field. 25. <TCGA-VCF> Validation of complex rearrangement records: a. If INFO field includes key-value pairs "SVTYPE=BND" or "SVTYPE=FND" and has values for "MATEID" and/or "PARID", then the value (or multiple comma-separated values) assigned to MATEID or PARID should exist in the file as "ID" field for another If INFO field includes "SID", "GENE" or "RGN" keys with associated values, then file MUST contain a declaration for ##geneanno in the HEADER. Number of comma-separated values in the optional INFO sub-fields "SID", "GENE" and "RGN" and the FORMAT sub-field "TE" must be the same if more than one of these sub-fields are defined for a record. INFO sub-field "RGN" is in {5_utr, 3_utr, exon, intron, ncds, sp}. FORMAT sub-field "TE" is in {SIL, MIS, NSNS, NSTP, FSH, NA}

14 27. e. If "RGN" and "TE" have the same number of comma-separated values, then "RGN" must be "exon" for "TE" to have any value other than "NA". For example, if "RGN=exon,intron,intron" then having "MIS,SIL,NA" for TE would lead to a violation as the 2nd value for RGN is "intron" but the corresponding TE value is "SIL" instead of "NA". <TCGA-VCF> Validation of vcfprocesslog tags: ##vcfprocesslog=<inputvcf=<file1.vcf>,inputvcfsource=<varcaller1>,inputvcfver=<1.0>,inputvcfparam= a. Individual values for each tag are enclosed within angle brackets (<>) instead of double quotes. b. If a field contains multiple values, they are separated by comma. Exception: Separator for multiple values in InputVCFParam and MergeParam is a ";" instead of ",". Individual values within these tags can contain comma-separated parameters (e.g., <a1,c2;a1,b1;a1,b1 in the example given above). c. If InputVCF tag has multiple comma-separated values assigned to it (please refer to the second example above), then InputVCFSource, InputVCFVer, InputVCFParam, and InputVCFgeneAnno must contain the same number of values. If a value is not known, it should be substituted with the missing value identifier ("."). d. If InputVCF contains only a single value, then all tags =~ /Merge.*/ are optional and can either be omitted or can contain missing value identifier ("."). The reason is that attribute related to merging VCF files are applicable only if multiple input VCF files are being merged. e. If MergeSoftware contains multiple comma-separated values, MergeParam and MergeVer should contain the same number of values. There is no such constraint for MergeContact. Handling failed checks A VCF file would be required to pass ALL the checks listed above and any violation will lead to a "Failed" validation. Even if a failure is encountered, the file would still need to go through all other checks for validation to be complete. Exception to this requirement would include cases where execution of one validation check is dependent on the success of another prerequisite step. For example, number of values associated with a FORMAT field for a variant record cannot be validated if the field itself is not declared in the HEADER or has a missing Number tag. A summary of all failed checks should be provided as an output. Test files The following table lists sample files that can be used to test various validation steps. Parent directory is Table 7: Test files known to pass/fail validation steps Expected result Success Validation file with no failures Test file /vcfformat/tcga _illuminaga-dnaseq_exome.format2.vcf Failure "chr" prefix for chromosome names missing values for FORMAT fields /BI/ /BCM-GBM_solid.TCGA mut.vcf Failure column header line has no "#" as beginning trailing whitespace at the end of description strings in metadata filter not listed in FILTER metadata line in multi-allelic sites, different alternative alleles in the ALT field are separated by "/" instead of "," /genome.wustl.edu/ /tcga a-01w vcf.gz Failure double quotes missing from SAMPLE metadata lines Description /BCM/TCGA _IlluminaGA-DNASeq_exome.vcf Failure scores in QUAL column are negative integers /UCSC/ /TCGA _W_capture.vcf

Briefly: Bioinformatics File Formats. J Fass September 2018

Briefly: Bioinformatics File Formats J Fass September 2018 Overview ASCII Text Sequence Fasta, Fastq ~Annotation TSV, CSV, BED, GFF, GTF, VCF, SAM Binary (Data, Compressed, Executable) Data HDF5 BAM /