elprep: a high- performance tool for preparing SAM/BAM files for variant calling Charlo<e Herzeel (Imec) Pascal Costanza (Intel) July 2014
|
|
- Brent McGee
- 5 years ago
- Views:
Transcription
1 elprep: a high- performance tool for preparing SAM/BAM files for variant calling Charlo<e Herzeel (Imec) Pascal Costanza (Intel) July 2014
2 Overview Sequencing pipelines in praclce From raw reads to mapped reads to analysis- ready reads Many different tools in use CommunicaLon via files using standardized formats (SAM/BAM) Repeated file I/O for each step, li<le mull- threading in exislng tools elprep: high- performance preparalon of SAM/BAM MulL- threaded, executes enlrely in memory Expresses pipeline as a set of filters on a stream of read data Avoids file I/O, merges computalons of several pipeline steps Open- source, modular, extensible implementalon Benchmarks 2
3 Sequencing pipelines in praclce Goal: from raw reads to analysis Many compelng pipelines and tools in use CommunicaLon via standardized file formats Raw Reads Mapped Reads Analysis Ready Reads Analyzed Reads A"G"G"C C"T"A G"T"T _A"A"T..." "A"G"C"C"T"A"A"T"T"G"A"A"T... 3
4 Sequencing pipelines in praclce Example: GATK best- praclce pipeline (Broad InsLtute): Step Tool File I/O 1. Alignment via bwa in:.fastq, out:.sai 2. SAI to SAM via bwa samse in:.sai, out:.sam 3. SAM to BAM via samtools in:.sam, out:.bam 4. Removing unmapped reads via samtools in:.bam, out:.bam 5. Marking duplicate reads via picard in:.bam, out:.bam 6. Adding RG informalon via picard in:.bam, out:.bam 7. SorLng for conlgs via picard in:.bam, out:.bam 8. SorLng for coordinate order via samtools in:.bam, out:.bam 9. Indel realignment via gatk in:.bam, out:.bam 10. Base recalibralon via gatk in:.bam, out:.bam 11. RR compression via gatk in:.bam, out:.bam 12. Variant calling via gatk in:.bam, out:.vcf 4
5 Sequencing pipelines in praclce Example: GATK best- praclce pipeline (Broad InsLtute): Alternate uses of C and Java programs File I/O, including compression/decompression, at each step Repeated iteralons over the same data set Be<er performing solulon: Keep data in memory throughout enlre pipeline Merge loops where possible by implemenlng pipeline steps as filters that can be composed Make use of big RAM servers We implemented this approach for the preparalon phase as a new tool called elprep 5
6 Pipeline steps as filters Most steps in the pipeline operate on BAM files: SAM/BAM is a standard format for represenlng mapped reads. The pipeline steps can be seen as filters that update or add fields to the BAM representalon of a read, or remove reads all together 6
7 Pipeline steps as filters Removing unmapped reads Check if FLAG has bit 0x4 set (see 2 nd entry) ERR * 0 0 * * 0 0 NCATTCCATTTCATTCCACTAGGGTTCATTCCATTCCGTTCCATTCCATTCCACTCNTGTTGAT TCCATTCCGTNCCTTTCCTTTTCATTCCATTCAATT #1BDFFFFHHGHHJJJJJJJJJJJFGHJIJJJJJJJJJIEIJJJJJJIJJJJJJJI#0? FHIIJJJJJJJJDHI#- 5@EHIEHGHHEDFFEDDDDE@CEE ERR chr M * 0 0 GCTTCTCCTGAGATCATCGTTCCTGNTCCTGGAACACTTTTCCNCCCTAAATTTTACTTTTTAA ATCTTTCTTATTGTTTTTGTTTGCCTTCTGTTGCTN DDDDDDDDDDEDEDDDDDDDDCA;,#FFHHFGHHEHHGIGCB- #JJJJJJJJJIGFJJIJJJJJJIJJJJJJJJJJJJJJJJJIJJJHHHHHFFFDD=1# XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:25C17A55T0 7
8 Pipeline steps as filters Removing unmapped reads Check if FLAG has bit 0x4 set (see 2 nd entry) ERR * 0 0 * * 0 0 NCATTCCATTTCATTCCACTAGGGTTCATTCCATTCCGTTCCATTCCATTCCACTCNTGTTGAT TCCATTCCGTNCCTTTCCTTTTCATTCCATTCAATT #1BDFFFFHHGHHJJJJJJJJJJJFGHJIJJJJJJJJJIEIJJJJJJIJJJJJJJI#0? FHIIJJJJJJJJDHI#- 5@EHIEHGHHEDFFEDDDDE@CEE ERR chr M * 0 0 GCTTCTCCTGAGATCATCGTTCCTGNTCCTGGAACACTTTTCCNCCCTAAATTTTACTTTTTAA ATCTTTCTTATTGTTTTTGTTTGCCTTCTGTTGCTN DDDDDDDDDDEDEDDDDDDDDCA;,#FFHHFGHHEHHGIGCB- #JJJJJJJJJIGFJJIJJJJJJIJJJJJJJJJJJJJJJJJIJJJHHHHHFFFDD=1# XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:25C17A55T0 8
9 Pipeline steps as filters Adding read group informalon Add oplonal tag to each read entry + add header tag ERR chr M * 0 0 GCTTCTCCTGAGATCATCGTTCCTGNTCCTGGAACACTTTTCCNCCCTAAATTTTACTTTTTAA ATCTTTCTTATTGTTTTTGTTTGCCTTCTGTTGCTN DDDDDDDDDDEDEDDDDDDDDCA;,#FFHHFGHHEHHGIGCB- #JJJJJJJJJIGFJJIJJJJJJIJJJJJJJJJJJJJJJJJIJJJHHHHHFFFDD=1# XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:25C17A55T0 9
10 Pipeline steps as filters Adding read group informalon Add oplonal tag to each read entry + add header ID:group1 PL:illumina PU:unit1 LB:lib1 SM:sample1 ERR chr M * 0 0 GCTTCTCCTGAGATCATCGTTCCTGNTCCTGGAACACTTTTCCNCCCTAAATTTTACTTTTTAA ATCTTTCTTATTGTTTTTGTTTGCCTTCTGTTGCTN DDDDDDDDDDEDEDDDDDDDDCA;,#FFHHFGHHEHHGIGCB- #JJJJJJJJJIGFJJIJJJJJJIJJJJJJJJJJJJJJJJJIJJJHHHHHFFFDD=1# XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:25C17A55T0 RG:Z:group1 10
11 Pipeline steps as filters: Marking duplicates PCR duplicates: When mullple reads of the same DNA molecule occur Inherent to the wet lab process of sequencing Hard to idenlfy because PCR duplicates do not necessarily produce the same sequence reads Marking duplicates in Picard: We derive the algorithmic structure from the source code (1117 loc Java) 11
12 Phase 1: Marking duplicates (Picard) sort reads according to mapping coordinates while keeping track of the original order spilled to disk when running out of RAM (manually) pos = 0! sorted-list = []! loop for line in input-file! read = new read(parse(line), pos)! sorted-list.insert-sorted(read, #coordinate-order)! pos++! 12
13 Marking duplicates (Picard) Phase 2: idenlfy groups of potenlal duplicates within the sorted list for those groups of reads, track the file indexes for all but the read with the highest Phred score pos-reads-to-mark = []! duplicates-list = [sorted-list.pop()]! loop for read in sorted-list! dup-read = duplicates-list.peek()! if read.coordinate() == dup-read.coordinate()! duplicates-list.push(read)! else! best-read = duplicates-list.peek()! loop for dup-read in duplicates-list! if best-read.score() < dup-read.score()! best-read = dup-read! loop for dup-read in duplicates-list! if dup-read!= best-read! pos-reads-to-mark.append(dup-read.pos())! duplicates-list = [read]! 13
14 Phase 3: Marking duplicates (Picard) write a new file by copying the reads from the original file, using the file posilon to idenlfy the reads to tag as duplicate pos = 0! read-to-mark = pos-reads-to-mark.pop()! loop for line in input-file! read = new read (parse(line))! if pos == read-to-mark! write read.mark() to output-file! read-to-mark = pos-reads-to-mark.pop()! else! write read to outputfile! pos++! 14
15 Picard algorithm: Marking duplicates (Picard) MulLple loops over the same data The.BAM file is opened and parsed into objects twice: Once for idenlfying the posilons of the duplicate reads in the file Secondly for updalng those reads Intermediate files for spilling sorted reads to disk manually We can improve on this 15
16 Marking duplicates (Picard) Conceptually, this is what marking duplicates does: Two reads are considered duplicates if they map to the same coordinate in the reference Mark the one with the lesser Phred score as duplicate 16
17 Marking duplicates as a filter Algorithmic redesign in elprep: Keep a table of reads checked for duplicate marking so far, hashed on mapping coordinate Only one pass through the reads necessary cache = [] loop for read in reads cached- read = cache.hash(read, #coordinate) if read.score() > cached- read.score() cached- read.mark() cache.remove(cached- read) cache.add(read) else read.mark() 17
18 Composing the filters Define pipeline as a higher- order funclon: One loop to go through read data Pass filters to apply as funclons filters = [#'remove-unmapped, #'mark-duplicates,...]! loop for read in reads! loop for func in filters! apply func read! 18
19 Composing the filters Define pipeline as a higher- order funclon: One loop to go through read data Pass filters to apply as funclons PersisLng to file filters = [#'remove-unmapped, #'mark-duplicates,...]! loop for read in reads! loop for func in filters! if not(apply func read)! return! write read to file! 19
20 Domain- specific filter architecture A filter receives a SAM header It returns a thread- local filter. A thread- local filter returns an alignment filter. An alignment filter receives a SAM alignment. It can modify the alignment. It returns a boolean whether to keep the alignment or not. Alignment filters are executed in parallel 20
21 Pipeline execulon without elprep - - Only some exislng tools are parallelized Most steps are executed sequenlally, with repeated file I/O 21
22 HypotheLcal execulon with parallelized tools - Even if all tools were parallelized, we would slll face sequenlal bo<lenecks and repeated file I/O 22
23 Pipeline execulon with elprep elprep composes all steps and executes them one by one in parallel Parallel threads do not have to meet in between steps, and repeated file I/O is completely avoided (see also: Amdahl s Law, h<ps://en.wikipedia.org/wiki/amdahl%27s_law) 23
24 Command line tool elprep, an open source tool Filtering unmapped reads (strict / non- strict) Replacing reference sequences Replacing read groups Marking / removing duplicates Cleaning SAM SorLng by coordinate order / queryname DocumentaLon for command line and API h<ps://github.com/exascience/elprep 24
25 Benchmarks elprep vs. SAMtools/Picard: single- threaded: 4x faster tested with whole- genome data NA12878 mull- threaded: 25x faster at 40 threads (4x10- core Intel Xeon E7-4870, 512GB RAM) 10.5x faster at 32 threads (2x8- core Intel Xeon E with hyper- threading, 256GB RAM) validated 100% match with SAMtools/Picard output checked with samtools and picard- tools
26 Benchmarks elprep reduces preparalon from 20 hours to 2 hours On a 2x8- core Intel Xeon E with 256GB RAM elprep merges execulon of the different preparalon steps removing unmapped reads reordering contigs sorting marking duplicates adding read groups combined NA12878 SAMtools + Picard elprep samtools/picard Removing unmapped reads 38m12s - Sort conlgs 4h55m55s - SorLng 1h46m45s - elprep Mark duplicates 6h59m44s - 0 2h 4h 6h 8h 10h 12h 14h 16h 18h 20h Read groups 4h58m53s - Total 19h19m29s 1h54m51s 26
27 Advantages of elprep Efficient mull- threaded execulon Operates enlrely in memory, no intermediate files An order of magnitude faster than standard tools 100% equivalent output to SAMtools and Picard CompaLble with exislng tools CompaLble with compressed SAM (BAM/CRAM) Modular soyware architecture, easy to extend Open source 27
28 NoLces and Disclaimers Copyright 2014 Imec and Intel CorporaLon. Intel, the Intel logo, and Xeon are trademarks of Intel CorporaLon in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. Soyware and workloads used in performance tests may have been oplmized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, soyware, operalons and funclons. Any change to any of those factors may cause the results to vary. You should consult other informalon and performance tests to assist you in fully evalualng your contemplated purchases, including the performance of that product when combined with other products. For more informalon go to h<p:// 28
ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018
ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 USA SAN FRANCISCO USA ORLANDO BELGIUM - HQ LEUVEN THE NETHERLANDS EINDHOVEN
More informationCBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection
CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for
More informationBioinformatics Framework
Persona: A High-Performance Bioinformatics Framework Stuart Byma 1, Sam Whitlock 1, Laura Flueratoru 2, Ethan Tseng 3, Christos Kozyrakis 4, Edouard Bugnion 1, James Larus 1 EPFL 1, U. Polytehnica of Bucharest
More informationNext Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010
Next Generation Sequence Alignment on the BRC Cluster Steve Newhouse 22 July 2010 Overview Practical guide to processing next generation sequencing data on the cluster No details on the inner workings
More informationSAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.
Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference
More informationNA12878 Platinum Genome GENALICE MAP Analysis Report
NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD Jan-Jaap Wesselink, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5
More informationREPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.
REPORT NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5 1.3 ACCURACY
More informationFalcon Accelerated Genomics Data Analysis Solutions. User Guide
Falcon Accelerated Genomics Data Analysis Solutions User Guide Falcon Computing Solutions, Inc. Version 1.0 3/30/2018 Table of Contents Introduction... 3 System Requirements and Installation... 4 Software
More informationCORE Year 1 Whole Genome Sequencing Final Data Format Requirements
CORE Year 1 Whole Genome Sequencing Final Data Format Requirements To all incumbent contractors of CORE year 1 WGS contracts, the following acts as the agreed to sample parameters issued by NHLBI for data
More informationNVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit
NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit Ben Walker Data Center Group Intel Corporation 2018 Storage Developer Conference. Intel Corporation. All Rights Reserved. 1 Notices
More informationProcessing Genomics Data: High Performance Computing meets Big Data. Jan Fostier
Processing Genomics Data: High Performance Computing meets Big Data Jan Fostier Traditional HPC way of doing things Communication network (Infiniband) Lots of communication c c c c c Lots of computations
More informationSAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call
SAMtools http://samtools.sourceforge.net/ SAM/BAM mapping BAM SAM BAM BAM sort & indexing (ex: IGV) mapping SNP call SAMtools NGS Program: samtools (Tools for alignments in the SAM format) Version: 0.1.19
More informationDecrypting your genome data privately in the cloud
Decrypting your genome data privately in the cloud Marc Sitges Data Manager@Made of Genes @madeofgenes The Human Genome 3.200 M (x2) Base pairs (bp) ~20.000 genes (~30%) (Exons ~1%) The Human Genome Project
More informationSequence Mapping and Assembly
Practical Introduction Sequence Mapping and Assembly December 8, 2014 Mary Kate Wing University of Michigan Center for Statistical Genetics Goals of This Session Learn basics of sequence data file formats
More informationHalvade: scalable sequence analysis with MapReduce
Bioinformatics Advance Access published March 26, 2015 Halvade: scalable sequence analysis with MapReduce Dries Decap 1,5, Joke Reumers 2,5, Charlotte Herzeel 3,5, Pascal Costanza, 4,5 and Jan Fostier
More informationWelcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.
Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your
More informationBen Walker Data Center Group Intel Corporation
Ben Walker Data Center Group Intel Corporation Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation.
More informationDaniel Verkamp, Software Engineer
Daniel Verkamp, Software Engineer Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,
More informationReads Alignment and Variant Calling
Reads Alignment and Variant Calling CB2-201 Computational Biology and Bioinformatics February 22, 2016 Emidio Capriotti http://biofold.org/ Institute for Mathematical Modeling of Biological Systems Department
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationPRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR
PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR GOAL OF THIS SESSION Assuming that The audiences know how to perform GWAS
More informationPerformance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor
* Some names and brands may be claimed as the property of others. Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor E.J. Bylaska 1, M. Jacquelin
More informationUsing Map-Reduce to Teach Parallel Programming Concepts
Using Map-Reduce to Teach Parallel Programming Concepts Dick Brown, St. Olaf College Libby Shoop, Macalester College Joel Adams, Calvin College Workshop site CSinParallel.org -> Workshops -> WMR Workshop
More informationIntel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor
Technical Resources Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPETY RIGHTS
More informationHigh-throughout sequencing and using short-read aligners. Simon Anders
High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel
More informationOut-of-band (OOB) Management of Storage Software through Baseboard Management Controller Piotr Wysocki, Kapil Karkra Intel
Out-of-band (OOB) Management of Storage Software through Baseboard Management Controller Piotr Wysocki, Kapil Karkra Intel 2018 Storage Developer Conference. Intel Corporation. All Rights Reserved. 1 Notices
More informationAnalysing re-sequencing samples. Malin Larsson WABI / SciLifeLab
Analysing re-sequencing samples Malin Larsson Malin.larsson@scilifelab.se WABI / SciLifeLab Re-sequencing Reference genome assembly...gtgcgtagactgctagatcgaaga...! Re-sequencing IND 1! GTAGACT! AGATCGG!
More informationSuper-Fast Genome BWA-Bam-Sort on GLAD
1 Hututa Technologies Limited Super-Fast Genome BWA-Bam-Sort on GLAD Zhiqiang Ma, Wangjun Lv and Lin Gu May 2016 1 2 Executive Summary Aligning the sequenced reads in FASTQ files and converting the resulted
More informationIntel optane memory as platform accelerator. Vladimir Knyazkin
Intel optane memory as platform accelerator Vladimir Knyazkin 2 Legal Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service
More informationSupplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline
Supplementary Information Detecting and annotating genetic variations using the HugeSeq pipeline Hugo Y. K. Lam 1,#, Cuiping Pan 1, Michael J. Clark 1, Phil Lacroute 1, Rui Chen 1, Rajini Haraksingh 1,
More informationFrom fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /
From fastq to vcf Overview of resequencing analysis samples fastq fastq fastq fastq mapping bam bam bam bam variant calling samples 18917 C A 0/0 0/0 0/0 0/0 18969 G T 0/0 0/0 0/0 0/0 19022 G T 0/1 1/1
More informationJim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018
Jim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018 jim@intel.com 1 How did this Effort Start? Memristor MRAM Carbon Nanotube
More informationSentieon Documentation
Sentieon Documentation Release 201808.03 Sentieon, Inc Dec 21, 2018 Sentieon Manual 1 Introduction 1 1.1 Description.............................................. 1 1.2 Benefits and Value..........................................
More informationINTRODUCTION AUX FORMATS DE FICHIERS
INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan
More informationIntel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation
Intel Atom Processor Based Platform Technologies Intelligent Systems Group Intel Corporation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
More informationA U G U S T 8, S A N T A C L A R A, C A
A U G U S T 8, 2 0 1 8 S A N T A C L A R A, C A Data-Centric Innovation Summit LISA SPELMAN VICE PRESIDENT & GENERAL MANAGER INTEL XEON PRODUCTS AND DATA CENTER MARKETING Increased integration and optimization
More informationFile Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
File Formats: SAM, BAM, and CRAM UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 / BAM / CRAM NEW! http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and
More informationPreparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers
Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions
More informationRead Mapping and Variant Calling
Read Mapping and Variant Calling Whole Genome Resequencing Sequencing mul:ple individuals from the same species Reference genome is already available Discover varia:ons in the genomes between and within
More informationExome sequencing. Jong Kyoung Kim
Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic
More informationOPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE
OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE Sean Hefty Openfabrics Interfaces Working Group Co-Chair Intel November 2016 OFIWG: develop interfaces aligned with application needs Open Source Expand
More informationCalling variants in diploid or multiploid genomes
Calling variants in diploid or multiploid genomes Diploid genomes The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen: 1. 2.
More informationHinri Kerstens. NGS pipeline using Broad's Cromwell
Hinri Kerstens NGS pipeline using Broad's Cromwell Introduction Princess Máxima Center is a organization fully specialized in pediatric oncology. By combining the best possible research and care, we will
More informationIntel Architecture 2S Server Tioga Pass Performance and Power Optimization
Intel Architecture 2S Server Tioga Pass Performance and Power Optimization Terry Trausch/Platform Architect/Intel Inc. Whitney Zhao/HW Engineer/Facebook Inc. Agenda Tioga Pass Feature Overview Intel Xeon
More informationPerl for Biologists. Session 8. April 30, Practical examples. (/home/jarekp/perl_08) Jon Zhang
Perl for Biologists Session 8 April 30, 2014 Practical examples (/home/jarekp/perl_08) Jon Zhang Session 8: Examples CBSU Perl for Biologists 1.1 1 Review of Session 7 Regular expression: a specific pattern
More informationVariation among genomes
Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant
More informationHigh Performance Computing The Essential Tool for a Knowledge Economy
High Performance Computing The Essential Tool for a Knowledge Economy Rajeeb Hazra Vice President & General Manager Technical Computing Group Datacenter & Connected Systems Group July 22 nd 2013 1 What
More informationTumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual
Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual Zhan Zhou, Xingzheng Lyu and Jingcheng Wu Zhejiang University, CHINA March, 2016 USER'S MANUAL TABLE OF CONTENTS 1 GETTING STARTED... 1 1.1
More informationModernizing Servers and Software
SMB PLANNING GUIDE Modernizing Servers and Software Increase Performance with Intel Xeon Processor E3 v3 Family Servers and Windows Server* 2012 R2 Software Why You Should Read This Document This planning
More informationAccelerating Data Center Workloads with FPGAs
Accelerating Data Center Workloads with FPGAs Enno Lübbers NorCAS 2017, Linköping, Sweden Intel technologies features and benefits depend on system configuration and may require enabled hardware, software
More informationOverview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation
Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation Agenda 1D interpolation problem statement Computation flow Application areas Data fitting in Intel MKL Data
More informationDemultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)
next generation sequencing analysis guidelines Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs) See what more we can do for you at www.idtdna.com. For Research Use Only
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationPractical exercises Day 2. Variant Calling
Practical exercises Day 2 Variant Calling Samtools mpileup Variant calling with samtools mpileup + bcftools Variant calling with HaplotypeCaller (GATK Best Practices) Genotype GVCFs Hard Filtering Variant
More informationVälkommen. Intel Anders Huge
Välkommen Intel Anders Huge Transformative Technology from Intel A n d e r s H u g e I n t e l Why intel INTEL CORPORATION 5 TRANSFORMING BUSINESS MODERN BUSINESS DEMANDS Intel VISION Accelerate workplace
More informationINTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT
INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT UPDATE ON OPENSWR: A SCALABLE HIGH- PERFORMANCE SOFTWARE RASTERIZER FOR SCIVIS Jefferson Amstutz Intel
More informationNGS Data Visualization and Exploration Using IGV
1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians
More informationGalaxy Platform For NGS Data Analyses
Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account
More informationGenomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun
Genomes On The Cloud GotCloud University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun Friday, March 8, 2013 Why GotCloud? Connects sequence analysis tools together Alignment, quality
More informationSequence mapping and assembly. Alistair Ward - Boston College
Sequence mapping and assembly Alistair Ward - Boston College Sequenced a genome? Fragmented a genome -> DNA library PCR amplification Sequence reads (ends of DNA fragment for mate pairs) We no longer have
More informationThe SAM Format Specification (v1.3-r837)
The SAM Format Specification (v1.3-r837) The SAM Format Specification Working Group November 18, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited
More informationBitonic Sorting Intel OpenCL SDK Sample Documentation
Intel OpenCL SDK Sample Documentation Document Number: 325262-002US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL
More informationHPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF
HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF 1 Outline KNL results Our other work related to HPCG 2 ~47 GF/s per KNL ~10
More informationUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window
More informationCase Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing
Case Study Software Optimizing an Illegal Image Filter System Intel Integrated Performance Primitives High-Performance Computing Tencent Doubles the Speed of its Illegal Image Filter System using SIMD
More informationWM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder
WM2 Bioinformatics ExomeSeq data analysis part 1 Dietmar Rieder RAW data Use putty to logon to cluster.i med.ac.at In your home directory make directory to store raw data $ mkdir 00_RAW Copy raw fastq
More informationStorage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer
Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer Agenda Threading model discussion SPDK Environment Layer SPDK Application Framework SPDK Blockdev Layer SPDK Example Apps 2
More informationData and Intelligence in Storage Carol Wilder Intel Corporation
Data and Intelligence in Storage Carol Wilder carol.a.wilder@intel.com Intel Corporation 1 Legal Notices/Disclaimer Intel technologies features and benefits depend on system configuration and may require
More informationINTEL PENTIUM Gold AND CELERON PROCESSORS
INTEL PENTIUM Gold AND CELERON PROCESSORS Reliable performance and affordable priced PCs for value-oriented buyers Impressive performance for work and play. The new Pentium Gold processor provides great
More informationSample for OpenCL* and DirectX* Video Acceleration Surface Sharing
Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing User s Guide Intel SDK for OpenCL* Applications Sample Documentation Copyright 2010 2013 Intel Corporation All Rights Reserved Document
More informationIntel Stereo 3D SDK Developer s Guide. Alpha Release
Intel Stereo 3D SDK Developer s Guide Alpha Release Contents Why Intel Stereo 3D SDK?... 3 HW and SW requirements... 3 Intel Stereo 3D SDK samples... 3 Developing Intel Stereo 3D SDK Applications... 4
More informationIntel Core TM Processor i C Embedded Application Power Guideline Addendum
Intel Core TM Processor i3-2115 C Embedded Application Power Guideline Addendum August 2012 Document Number: 327874-001US INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
More informationLecture 12. Short read aligners
Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola
More informationSequence Alignment: Mo1va1on and Algorithms. Lecture 2: August 23, 2012
Sequence Alignment: Mo1va1on and Algorithms Lecture 2: August 23, 2012 Mo1va1on and Introduc1on Importance of Sequence Alignment For DNA, RNA and amino acid sequences, high sequence similarity usually
More informationAgroMarker Finder manual (1.1)
AgroMarker Finder manual (1.1) 1. Introduction 2. Installation 3. How to run? 4. How to use? 5. Java program for calculating of restriction enzyme sites (TaqαI). 1. Introduction AgroMarker Finder (AMF)is
More informationPOWER YOUR CREATIVITY WITH THE INTEL CORE X-SERIES PROCESSOR FAMILY
Product Brief POWER YOUR CREATIVITY WITH THE INTEL CORE X-SERIES PROCESSOR FAMILY The Ultimate Creator PC Platform Made to create, the latest X-series processor family is powered by up to 18 cores and
More informationNGS Data Analysis. Roberto Preste
NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr
More informationApril 2 nd, Bob Burroughs Director, HPC Solution Sales
April 2 nd, 2019 Bob Burroughs Director, HPC Solution Sales Today - Introducing 2 nd Generation Intel Xeon Scalable Processors how Intel Speeds HPC performance Work Time System Peak Efficiency Software
More informationThe SAM Format Specification (v1.3 draft)
The SAM Format Specification (v1.3 draft) The SAM Format Specification Working Group July 15, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationAccelrys Pipeline Pilot and HP ProLiant servers
Accelrys Pipeline Pilot and HP ProLiant servers A performance overview Technical white paper Table of contents Introduction... 2 Accelrys Pipeline Pilot benchmarks on HP ProLiant servers... 2 NGS Collection
More informationHow to map millions of short DNA reads produced by Next-Gen Sequencing instruments onto a reference genome
How to map millions of short DNA reads produced by Next-Gen Sequencing instruments onto a reference genome Stratos Efstathiadis stratos@nyu.edu Slides are from Cole Trapneli, Steven Salzberg, Ben Langmead,
More informationH.J. Lu, Sunil K Pandey. Intel. November, 2018
H.J. Lu, Sunil K Pandey Intel November, 2018 Issues with Run-time Library on IA Memory, string and math functions in today s glibc are optimized for today s Intel processors: AVX/AVX2/AVX512 FMA It takes
More informationSequence Alignment: Mo1va1on and Algorithms
Sequence Alignment: Mo1va1on and Algorithms Mo1va1on and Introduc1on Importance of Sequence Alignment For DNA, RNA and amino acid sequences, high sequence similarity usually implies significant func1onal
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationIntel Cluster Checker 3.0 webinar
Intel Cluster Checker 3.0 webinar June 3, 2015 Christopher Heller Technical Consulting Engineer Q2, 2015 1 Introduction Intel Cluster Checker 3.0 is a systems tool for Linux high performance compute clusters
More informationBuilt to Scale: The Intel Xeon Processor E7 and E5 Families in Cisco UCS
Built to Scale: The Intel Xeon Processor E7 and E5 Families in Cisco UCS Abdelaziz BENETTAIB Market Development Manager Intel Corporation The Heart of a Flexible, Efficient Data Center More Devices More
More informationPre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory
Pre-processing and quality control of sequence data Barbera van Schaik KEBB - Bioinformatics Laboratory b.d.vanschaik@amc.uva.nl Topic: quality control and prepare data for the interesting stuf Keep Throw
More informationOpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel
OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel Clang * : An Excellent C++ Compiler LLVM * : Collection of modular and reusable compiler and toolchain technologies Created by Chris Lattner
More informationContributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth
Presenter: Surabhi Jain Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth May 25, 2018 ROME workshop (in conjunction with IPDPS 2018), Vancouver,
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationChangpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer
Changpeng Liu, Cloud Software Engineer Piotr Pelpliński, Cloud Software Engineer Introduction to VirtIO and Vhost SPDK Vhost Architecture Use cases for vhost Benchmarks Next steps QEMU VIRTIO Vhost (KERNEL)
More informationAssembly of the Ariolimax dolicophallus genome with Discovar de novo. Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves
Assembly of the Ariolimax dolicophallus genome with Discovar de novo Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves Overview -Introduction -Pair correction and filling -Assembly theory
More informationWITH INTEL TECHNOLOGIES
WITH INTEL TECHNOLOGIES Commitment Is to Enable The Best Democratize technologies Advance solutions Unleash innovations Intel Xeon Scalable Processor Family Delivers Ideal Enterprise Solutions NEW Intel
More informationSoftware Optimization Case Study. Yu-Ping Zhao
Software Optimization Case Study Yu-Ping Zhao Yuping.zhao@intel.com Agenda RELION Background RELION ITAC and VTUE Analyze RELION Auto-Refine Workload Optimization RELION 2D Classification Workload Optimization
More informationRNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF
RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au
More informationEnsembl RNASeq Practical. Overview
Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted
More informationTHE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF
14th ANNUAL WORKSHOP 2018 THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF Paul Luse Intel Corporation Apr 2018 AGENDA Storage Performance Development Kit What is SPDK? The SPDK Community Why are so
More information