Falcon Accelerated Genomics Data Analysis Solutions. User Guide

Size: px

Start display at page:

Download "Falcon Accelerated Genomics Data Analysis Solutions. User Guide"

Christine Johnston
5 years ago
Views:

1 Falcon Accelerated Genomics Data Analysis Solutions User Guide Falcon Computing Solutions, Inc. Version 1.0 3/30/2018

2 Table of Contents Introduction... 3 System Requirements and Installation... 4 Software Prerequisites... 4 System Setup... 4 Preparation... 4 Synopsis... 5 Common Options Among Methods... 5 fcs-genome align... 5 fcs-genome markdup... 6 fcs-genome indel... 6 fcs-genome bqsr... 6 fcs-genome baserecal... 6 fcs-genome prreads... 7 fcs-genome htc... 7 fcs-genome ug... 7 fcs-genome jo... 7 fcs-genome gatk... 8 Quick Start... 9 Generating a Marked Duplicates BAM file from Paired-End FASTQ files... 9 Performing Indel Re-alignment from a Marked Duplicates BAM file Performing Base Quality Score Recalibration (BQSR)... Error! Bookmark not defined. Generating Base Quality Score Recalibration Report (BQSR) from a BAM file with known sites Generating Genomic VCF (gvcf) file from a BAM file with Haplotype Caller Tuning Configurations Reference Table for Configurations Page 2

Introduction The Falcon Accelerated Genomics Data Analysis Solutions comprising the fcs-genome software allows for variant calling for both germline and somatic mutations based on the GATK Best

3 Introduction The Falcon Accelerated Genomics Data Analysis Solutions comprising the fcs-genome software allows for variant calling for both germline and somatic mutations based on the GATK Best Practices pipelines. The performance of the pipelines is significantly improved with Falcon's acceleration technologies. Symmetric to the GATK Best Practices pipelines, the typical workflow starts with raw FASTQ sequence paired-end reads and proceeds to obtain a filtered set of variants that can be annotated for further analysis. The figure below depicts the flow of the germline variant calling pipeline. Beginning with paired-end FASTQ sequence files, the first step is to map the sequences to the reference. The resulting mapped BAM file is sorted, and duplicates are marked. This step performed using the command fcs-genome align, is equivalent to BWA-MEM, samtools sort and picard MarkDuplicates of the GATK Best Practices pipeline. The second step is to recalibrate base quality score in order to account for biases caused by the sequencing machine. The Falcon pipeline command for this is fcs-genome bqsr. Its GATK equivalent first runs the GATK BaseRecalibrator, which produces a table of recalibrated reads, followed by GATK PrReads which implements the table of recalibrated reads to produce a new, analysis-ready BAM file. The final step is germline variant calling, implementing the command fcs-genome htc which corresponds to GATK HaplotypeCaller. Figure 1. Side-by-side analysis of the Falcon Accelerated Pipeline and the GATK Best Practices Pipeline: The middle panel indicates the general workflow starting with 1. Mapping the FASTQ sequences to the reference 2. Recalibrating base quality score and finally 3. Calling germline variants. The upper and lower panels illustrate the command-line implementation of the workflow using the Falcon Accelerated Pipeline and GATK Best Practices Pipeline respectively. This User Guide provides details on the setup of the Falcon Genome pipeline, command-line usage and a step-by-step example to run the variant calling pipeline. Page 3

4 System Requirements and Installation Software Prerequisites The software package of the Falcon Genomics Solutions is self-contained with required software. Please refer to the release notes inside each software distribution for each component and its version. The recommended operating system and required packages are listed as follows: CentOS Linux 7.x epel-release, boost, glog, gflags, java System Setup The software for fcs-genome is installed in /usr/local/falcon System information can be modified and is stored in /usr/local/fcs-genome.conf. Details on tuning configuration parameters is explained in a later section. Export fcs-genome and other required tools to the PATH: source /usr/bin/falcon/setup.sh Preparation Working folder: Paths to the reference genome and the input data are required parameters for the pipeline to run. Setting up a working folder containing this data and allowing it to be readable is a mandatory step before the start of the pipeline. Temporary folder: Most steps in the pipeline produce ermediate files that need to be stored at a temporary location. It must be ensured that this location has free disk space between 3-5X times the size of the input files. The location of the temporary folder can be modified in /usr/local/fcsgenome.conf. Falcon License: A valid license needs to be setup in the environment variable $LM_LICENSE_PATH. If the license file is improperly configured, an error message is reported: [fcs-genome] ERROR: Cannot connect to the license server: -15 [fcs-genome] ERROR: Please contact support@falcon-computing.com for details. Obtaining the Reference and its index: The reference and its index can be downloaded from the Broad Institute website using the following FTP link: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/ To take full advantage of the FPGA acceleration provided by the Falcon Genome image, the reference genome needs to be preprocessed by running the script $FALCON_DIR/prepare-ref.sh <path-to-fasta>. This step is optional, and the regular reference genome files (FASTA) will still work without processing. The processed reference genome, on the other hand, will also work for other software such as BWA, Picard, GATK, etc. Optional arguments: GATK relies on files with known variants in its processing. For example, known variant files including the 1000 Genome indel sites, the Mills indel sites and the dbsnp sites can be given as additional parameters for the pipeline steps. These can also be downloaded from the Broad Institute website. Page 4

5 Synopsis This section provides all the methods available in the fcs-genome command with their respective options settings. fcs-genome align -r ref.fasta -1 input_1.fastq -2 input_2.fastq \ -o aln.sorted.bam --rg RG_ID --sp sample_id \ --pl platform --lb library fcs-genome markdup -i aln.sorted.bam -o aln.marked.bam fcs-genome indel -r ref.fasta -i aln.sorted.bam -o indel.bam fcs-genome bqsr -r ref.fasta -i indel.bam -o recal.bam fcs-genome baserecal -r ref.fasta -i indel.bam -o recalibration_report.grp fcs-genome prreads -r ref.fasta -b recalibration_report.grp -i indel.bam \ -o recal.bam fcs-genome htc -r ref.fasta -i recal.bam -o final.gvcf fcs-genome jo -r ref.fasta -i final.gvcf -o final.vcf fcs-genome ug -r ref.fasta -i recal.bam -o final.vcf fcs-genome gatk -T analysistype For additional parameters, type in the command-line fcs-genome [method]. The methods take the original GATK options by including in the fcs-genome command the option extra-options followed by "the GATK option". Example: fcs-genome prreads -r ref.fasta -b recalibration_report.grp -i indel.bam \ -o recal.bam --extra-options "-n " Please check the GATK documentation for all extra options available. The tables below show all options available in each method. (*): Common Options Among Methods -h --help pr help messages -f --force overwrite output files if they exist -O --extra-options String(*) extra options in GATK for the command. Use " " to enclose the GATK command. Example "--TheOption arg" fcs-genome align Perform alignment using the Burrows-Wheeler Algorithm. It is the equivalent of bwa-mem. By default, mark duplicates are performed. If align-only is set, no mark duplicate will be performed fastq1 String(*) input pair-end Read 1 FASTQ file -2 --fastq2 String(*) input pair-end Read 2 FASTQ file -o --output String(*) output BAM file (if --align-only is set the output will be a directory of BAM files Page 5

6 -R --rg String(*) read group id ('ID' in BAM header) -S --sp String(*) sample id ('SM' in BAM header) -P --pl String(*) platform id ('PL' in BAM header) -L --lb String(*) library id ('LB' in BAM header) -l --align-only skip mark duplicates fcs-genome markdup Takes a BAM file and mark duplicates the reads. -i --input String(*) input BAM file -o --output String(*) output BAM file fcs-genome indel Take a BAM file and perform indel re-alignment. -i --input String(*) input BAM file -o --output String(*) output BAM file -K --known String known indels for realignment (VCF format). If more VCF are considered, add K for each file fcs-genome bqsr Take a BAM file and perform Base Quality Score Recalibration. It can be performed within a region defined in the --knownsites option. If --bqsr is set, a report is generated. -b --bqsr String(*) output BQSR file (if left blank no file will be produced) -i --input String(*) input BAM file -o --output String(*) output BAM file -K --knownsites String(*) known indels for realignment (VCF format). If more VCF are considered, add K for each file fcs-genome baserecal Take a BAM file and generate a Base Quality Score Recalibration. -i --input String(*) input BAM file -o --output String(*) output BQSR file -K --knownsites String(*) known indels for realignment (VCF format). If more VCF are considered, add K for each file Page 6

7 fcs-genome prreads Take a BAM file and filter reads according to some settings defined in extra-options. -b --bqsr String(*) Input BQSR file -i --input String(*) input BAM file or directory -o --output String(*) output BAM files fcs-genome htc Take a BAM file and generate a gvcf file by default. If --produce-vcf is set, a VCF file is generated instead of gvcf. -i --input String(*) input BAM file or directory -o --output String(*) output gvcf/vcf file (if --skip-concat is set the output will be a directory of gvcf files) -v --produce-vcf produce VCF files from HaplotypeCaller instead of gvcf -s --skip-concat (deprecated) produce a set of gvcf/vcf files instead of one fcs-genome ug This method is the equivalent of UnifiedGenotype in GATK. It takes a BAM file as an input and generates a VCF file. It accepts options from GATK through --extra-options Type -i --input String(*) input BAM file or directory -o --output String(*) output a compressed VCF file -s --skip-concat produce a set of VCF files instead of one fcs-genome jo This method performs a jo variant calling from a set of VCF files. Type -i --input-dir String(*) input directory containing compressed gvcf files -o --output String(*) output compressed gvcf files -c --combine-only combine gvcfs only and skip genotyping -g --skip-combine (deprecated) perform genotype gvcfs only and skip combine gvcf Page 7

8 fcs-genome gatk This method emulates the original GATK command. Please refer the GATK documentation for additional details. Page 8

9 Quick Start The examples below were written in BASH script and quickly tested using an instance of 16-cores (Intel(R) Xeon(R) CPU E GHz, 2 threads per core) in AWS server. Each example below can be saved in a file and be submitted to the server as follows: chmod a+x myscript.sh ; nohup./myscript.sh & For illustration purposes, the FASTQ files (small_1.fastq.gz and small_2.fastq.gz) used in the examples below contain 10K paired-end reads. They can be generated easily from any paired-end reads FASTQ files using the following Linux commands: zcat originalfastq_r1.fastq.gz head -n > small_1.fastq ; gzip small_1.fastq zcat originalfastq_r2.fastq.gz head -n > small_2.fastq ; gzip small_2.fastq In FASTQ format, each DNA read consists of 4 lines. Therefore, to get 10,000 DNA reads, 40,000 lines need to be extracted from the original FASTQ file. For more exhaustive test, the platinum pedigree samples (NA12878, NA12891 and NA12892) can be used as examples. They can be downloaded from Alternatively, Illumina BaseSpace (account required) provides Public Data sequenced with the most recent technology. Generating a Marked Duplicates BAM file from Paired-End FASTQ files fcs-genome align performs alignment to the reference, sorts, marks duplicates, and save the mapped reads in a BAM file. If --align-only is set, no marking duplicates is performed. The BASH script below illustrates the usage of the align method: SAMPLE_ID="small" R1=${SAMPLE_ID}_1.fastq.gz R2=${SAMPLE_ID}_2.fastq.gz" REF="/local/ref/human_g1k_v37.fasta BAMFILE=${SAMPLE_ID}_marked_sorted.bam RG_ID="H0BA0ADXX" PLATFORM="Illumina" LIB="RD001" fcs-genome align \ -r $REF -1 $R1-2 $R2 \ -o ${BAMFILE} \ --rg $RG_ID --sp ${SAMPLE_ID} \ --pl ${PLATFORM} --lb ${LIB} For 10K paired-reads contained in the FASTQ files, it took 13 seconds for alignment and 1 second for marking duplicates in the AWS server. The BAM file is generated with its respective index. Page 9

10 Performing Indel Re-alignment from a Marked Duplicates BAM file Once the alignment is completed, indel-realignment is perfomed. The BASH script below demonstrates the usage of the indel method: REF="/local/ref/human_g1k_v37.fasta SAMPLE_ID="small" BAM_INPUT=${SAMPLE_ID}_marked_sorted.bam BAM_OUTPUT=${SAMPLE_ID}_marked_sorted_indel_realign fcs-genome indel \ -r $REF \ -i ${BAM_INPUT} \ -o ${BAM_OUTPUT} A folder called ${SAMPLE_ID}_marked_sorted_indel_realign/ is created with a set of BAM and bai files with indels re-aligned. It takes 100 seconds to perform in the AWS server. Performing Base Quality Score Recalibration (BQSR) from BAM file with pre-defined known sites fcs-genome bqsr performs GATK's Base Quality Score Recalibration and Pr Reads in a single command. Per-base quality scores produced by the sequencing machine are checked for errors and corrected. The recalibrated reads are written o a folder that contains a BAM files set. During the process, a recalibration report is generated. The script below illustrates the usage of bqsr method: REF="/local/ref/human_g1k_v37.fasta ThousandGen="/local/ref/1000G_phase1.indels.b37.vcf" Mills="/local/ref/Mills_and_1000G_gold_standard.indels.b37.vcf" SNP="/local/ref/dbsnp_138.b37.vcf" SAMPLE_ID="small" BAM_INPUT=${SAMPLE_ID}_marked_sorted_indel_realign BAM_OUTPUT=${SAMPLE_ID}_recalibrated fcs-genome bqsr \ -r $REF \ -i ${BAM_INPUT} -o ${BAM_OUTPUT} \ -b recalibration_report.grp \ -K $ThousandGen -K $Mills -K $SNP" For this example, it took around 1203 seconds in the AWS to complete. Generating Base Quality Score Recalibration Report (BQSR) from a BAM file with known sites In this example, the BQSR analysis was performed using as an input a folder that contained BAM files and their respective bai files. A base recalibration report is generated. REF="/local/ref/human_g1k_v37.fasta SAMPLE_ID="small" BAM_INPUT=${SAMPLE_ID}_marked_sorted_indel_realign ThousandGen="/local/ref/1000G_phase1.indels.b37.vcf" Mills="/local/ref/Mills_and_1000G_gold_standard.indels.b37.vcf" Page 10

11 SNP="/local/ref/dbsnp_138.b37.vcf" fcs-genome baserecal \ -r $REF \ -i ${BAM_INPUT} -o recalibration_report.grp \ -K $ThousandGen -K $Mills -K $SNP" The command also works with a single BAM file. It takes around 1177 seconds to complete. Generating Genomic VCF (gvcf) file from a BAM file with Haplotype Caller fcs-genome htc performs germline variant calling using the input BAM file with default output format as gvcf. if --produce-vcf is set, a VCF file is produced. SAMPLE_ID= small REF="/local/ref/human_g1k_v37.fasta BAM_INPUT=${SAMPLE_ID}_recalibrated.bam OutputVCF=${SAMPLE_ID}_final.gvcf fcs-genome htc \ -r ${REF} \ -i ${BAM_INPUT} \ -o ${OutputVCF} For this example, it takes 415 seconds to complete in the AWS server. The htc option accepts multiple BAM files as input. Page 11

12 Tuning Configurations Configurations can be tuned to define the settings for each command-line option during the run. The default configuration settings are stored in /usr/local/fcs-genome.conf. If a file with the same name fcsgenome.conf is presented in the present directory, its values will be used to overwrite the default values. In addition, environmental variables can be used to overwrite both default configurations and the configurations in fcs-genome.conf in the present directory. An example of the configuration settings for the germline variant calling pipeline is as below: temp_dir = /local/temp gatk.ncontigs = 32 gatk.nprocs = 16 gatk.nct = 1 gatk.memory = 8 The key temp_dir specifies the system folder to store temporary files. Some steps in `fcs-genome`, including `align`, will write large files to a temporary folder. Please ensure this configuration is set to a location with enough space. The recommended free space is 3~5x the input data size. Reference Table for Configurations Default Configuration key Type Value bwa.verbose 0 verbose level of bwa output bwa.nt -1 number of threads for bwa, default is set to use all available threads in the system Overflow list size in markdup The GATK steps, such as BaseRecalibratior, PrReads and HaplotypeCaller, are run in parallel. By default, 32 total processes will be used for each GATK step. To change the default number, the key gatk.ncontigs can be set. The configuration key gatk.nprocs is used to specify the number of concurrent processes in each step. gatk.memory specifies the memory consumed by each process. Ideally, gatk.nprocs should be less than or equal to the total number of CPU cores, and the product of gatk.nprocs and gatk.memory would be less than or equal to the total memory. The number of concurrent process number and memory per process can be changed to individual steps with the following format: [step-name].nprocs, [stepname].memory bwa.num_batches_per_part 20 max num records in each BAM file bwa.use_fpga bool true option to enable FPGA for bwa-mem bwa.use_sort bool true enable sorting in bwa-mem bwa.enforce_order bool true enforce strict sorting ordering bwa.fpga.bit_path string path to FPGA bitstream for bwa bwa.scaleout_mode bool enable scale-out mode for bwa markdup.max_files 4096 max opened files in markdup markdup.nt 16 thread num in markdup markdup.overflow-listsize gatk.scalout_mode bool enable scale-out mode for gatk gatk.v.path string default path to existing contig ervals gatk.ncontigs 32 default contig partition num in GATK steps Page 12

13 gatk.nprocs default process num in all GATK steps, set to cpu num or gatk.ncontics whichever is the lesser value gatk.nct 1 default thread number in GATK steps gatk.memory Int 8 default heap memory in GATK steps gatk.skip_pseudo_chr bool skip pseudo chromosome ervals gatk.bqsr.nprocs default process num in GATK BaseRecalibrator gatk.bqsr.nct default thread num in GATK gatk.bqsr.memory BaseRecalibrator default heap memory in GATK BaseRecalibrator gatk.pr.nprocs default process num in GATK PrReads gatk.pr.nct default thread num in GATK PrReads gatk.pr.memory default heap memory in GATK PrReads gatk.htc.nprocs gatk.htc.nct gatk.htc.memory gatk.indel.nprocs gatk.indel.memory gatk.ug.nprocs gatk.ug.nt gatk.ug.memory gatk.rtc.nt 16 gatk.rtc.memory 48 gatk.jo.ncontigs 32 gatk.combine.nprocs 16 gatk.genotype.nprocs Int 32 gatk.genotype.memory 4 default process num in GATK HaplotypeCaller default thread num in GATK HaplotypeCaller default heap memory in GATK HaplotypeCaller default process num in GATK IndelRealigner default heap memory in GATK IndelRealigner default process num in GATK UnifiedGenotyper default thread num in GATK UnifiedGenotyper default heap memory in GATK UnifiedGenotyper default thread num in GATK UnifiedGenotyper default heap memory in GATK UnifiedGenotyper Default contig partition num in jo genotyping Default process num in GATK CombineGVCFs default process num in GATK GenotypeGVCFs default heap memory in GATK GenotypeGVCFs Page 13

Exome sequencing. Jong Kyoung Kim

Exome sequencing. Jong Kyoung Kim Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic