Bioinformatics Framework

Similar documents
Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Super-Fast Genome BWA-Bam-Sort on GLAD

Running SNAP. The SNAP Team October 2012

Lecture 12. Short read aligners

NA12878 Platinum Genome GENALICE MAP Analysis Report

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Scalable RNA Sequencing on Clusters of Multicore Processors

Sentieon Documentation

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Running SNAP. The SNAP Team February 2012

elprep: a high- performance tool for preparing SAM/BAM files for variant calling Charlo<e Herzeel (Imec) Pascal Costanza (Intel) July 2014

Variant calling using SAMtools

ChIP-seq (NGS) Data Formats

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

Mapping NGS reads for genomics studies

ZFS for NGS data analysis

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018

Sequence Mapping and Assembly

Copy Number Variations Detection - TD. Using Sequenza under Galaxy

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

Ensembl RNASeq Practical. Overview

Practical Near-Data Processing for In-Memory Analytics Frameworks

Reads Alignment and Variant Calling

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Halvade: scalable sequence analysis with MapReduce

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

NGS Data Visualization and Exploration Using IGV

Practical Linux Examples

Helpful Galaxy screencasts are available at:

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

Heckaton. SQL Server's Memory Optimized OLTP Engine

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA

Galaxy Platform For NGS Data Analyses

Calling variants in diploid or multiploid genomes

Analyzing massive genomics datasets using Databricks Frank Austin Nothaft,

Tutorial: Using BWA aligner to identify low-coverage genomes in metagenome sample Umer Zeeshan Ijaz

Accelerate Applications Using EqualLogic Arrays with directcache

Decrypting your genome data privately in the cloud

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Variation among genomes

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

SRCMap: Energy Proportional Storage using Dynamic Consolidation

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

NGS Data Analysis. Roberto Preste

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

The Fusion Distributed File System

Under the Hood of Alignment Algorithms for NGS Researchers

Comparing Memory Systems for Chip Multiprocessors

Galaxy workshop at the Winter School Igor Makunin

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

In-Memory Data Management for Enterprise Applications. BigSys 2014, Stuttgart, September 2014 Johannes Wust Hasso Plattner Institute (now with SAP)

Bolt: I Know What You Did Last Summer In the Cloud

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

The Logic of Physical Garbage Collection in Deduplicating Storage

Parallel In-Situ Data Processing with Speculative Loading

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

Exome sequencing. Jong Kyoung Kim

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

An Introduction to Linux and Bowtie

Kelly et al. Genome Biology (2015) 16:6 DOI /s x. * Correspondence:

A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

DNA Sequencing analysis on Artemis

Practical Bioinformatics for Life Scientists. Week 4, Lecture 8. István Albert Bioinformatics Consulting Center Penn State

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Batch Processing Basic architecture

EMC Backup and Recovery for Microsoft SQL Server

Linux multi-core scalability

The Journey of an I/O request through the Block Layer

INTRODUCTION AUX FORMATS DE FICHIERS

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Easy visualization of the read coverage using the CoverageView package

Programming Systems for Big Data

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study

Dept. Of Computer Science, Colorado State University

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

Kaisen Lin and Michael Conley

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Transcription:

Persona: A High-Performance Bioinformatics Framework Stuart Byma 1, Sam Whitlock 1, Laura Flueratoru 2, Ethan Tseng 3, Christos Kozyrakis 4, Edouard Bugnion 1, James Larus 1 EPFL 1, U. Polytehnica of Bucharest 2, CMU 3, Stanford 4 12/07/2017 1

Agenda Motivation Bioinformatics Data and Tools Persona AGD Dataflow Engine Performance Results 2

Sequencing cost Not a wet lab problem anymore à IT / Systems problem 3

Implications ~300GB ~hours? Need efficient systems that scale well 4

Agenda Motivation Bioinformatics Data and Tools Persona AGD Dataflow Engine Performance Results ~300GB ~hours 5

What kind of data? Common sequencers produce Reads Snippets of DNA à AACCGCTAGCGCGCTAGCTCGAGCTAGAA 100-200 bases @sequence name, metadata ACGTTTCGATCGCGCCAGGAGGCTAG + -+*''))**55CCF@>>>>>CCCCCC times a few hundred million 6

Alignment Reference Genome Mismatch...TGACCTATAGCGATATAGCTTATTATTGGG-CAAAAATGGAATCGATTGATCG... Read: TATTATTGGGATAAAA-TGG Insertion Deletion times a few hundred million ~hours 7

Aligned Reads Stored in SAM/BAM read_name 16 chr12 85500011 70 18M * 0 0 TTTTACACACATTATCTC CDDFAEEC>EDDFFBCDEED?FCC@ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm Followed by Duplicate marking Sorting Recalibrations, analysis (variant calling) ~hours 8

Data and Tool Issues FASTQ SAM/BAM BED VCF 9

Persona Bioinformatics, Unified Aggregate Genomic Data 10

Agenda Motivation Bioinformatics Data and Tools Persona AGD Dataflow Engine Performance Results 11

Aggregate Genomic Data Bases Q-Scores Metadata Manifest Header Storage Subsystem Index Data compressed 12

Agenda Motivation Bioinformatics Data and Tools Persona AGD Dataflow Engine Performance Results 13

Dataflow AGD Chunks Dataflow execution framework Base on TensorFlow engine But no machine learning Operators perform computation on AGD chunks 14

Dataflow AGD Chunks Modularity Balance/tuning (bounded) Queueing 15

Fine-grained Threading AGD chunks optimized for storage Too coarse for some tasks Split into subchunks Delegate to executor shared resource Task queue + thread pool Aligners Notify ThreadPool Buf AGD 16

Aligner Graph AGD Chunks Get Chunk Alignment Executor Decompress/Parse Align Reads Compress Put Chunk 17

Graph Construction c = persona.read_chunk(path) d = persona.decompress(c) o = persona.align(d) sess = tf.session() result = sess.run([o]) 18

Persona Shell Persona Shell local runtime dist runtime align sort import $ persona align local i hg19 data/my_agd.json $ persona sort local data/my_agd.json 19

Distributed Computation Client $ persona client bwa-align Queue Service Server 0 Server 1 Server 1 Server N Storage Subsystem 20

Current Features Import data from FASTQ/BAM/SRA, export to BAM Sequence alignment with BWA-MEM, SNAP Dataset sorting Duplicate marking Dataset statistics (samtools flagstat) Read coverage (depth) 21

Agenda Motivation Bioinformatics Data and Tools Persona AGD Dataflow Engine Performance Results 22

Evaluation -- Setup Focused on sequence alignment using SNAP Throughput in bases aligned per second Data 223 million 101 base reads (~16GB) AGD chunks of 100K records Hardware 32X Ubuntu 16.04, 2x12 Xeon E5-2680v3 @ 2.5GHz Data on 6-disk RAID0 and single spindle drive 7 server Ceph object store for distributed execution 23

Evaluation -- Questions What are the bandwidth-saving effects of AGD? What is the overhead of the Persona framework? How well do Persona and AGD scale? 24

Performance AGD SNAP Persona SNAP * single disk Significantly less I/O à more efficient use of HW, BW 25

Persona Overhead * RAID-0 Negligible overhead! 26

Scaling Full dataset aligned in ~17 seconds 27

Scaling Limits 28

Persona Scalable Bioinformatics Aggregate Genomic Data https://github.com/epfl-vlsc/persona 29

backup 30

Performance Sort and Dup. Mark Sort By metadata or aligned location 1.54x speedup over samtools 5.15x speedup over Picard Dataset stats 2x speedup Duplicate marking Same algorithm as samblaster 3.73x faster than samblaster Coverage (depth) 2x speedup 31

Profiling 32

Read/Write Single Disk 33

Alignment Example: SNAP Build hash index of reference To align a read: Hash a portion (seed) Lookup Evaluate each hit Edit distance computation Cores align reads in parallel TATTATTGGGATAAAATGGTTT Reference Genome Index (40GB)...TATTACTGGGCAAAAATGGTTTATG... edit distance TATTATTGGGATAAAATGGTTT 34

Shared Data Sometimes need to share data between ops E.g. multi-gb index of reference genome Use TF session resource manager [string, string] à refcount object Op can create objects, provide handle to other ops ProviderOp Resource Manager LookupOrCreate() Lookup() [c, n] ConsumerOp 35

Data Movement Tensors not amenable to bioinfo data Leverage TF shared resources Implement reusable buffers Stable memory use Avoid syscalls Pool BufferPoolOp [container, name] 36

Bioinformatics? Biology, computer science, math, statistics Started mid 90 s with Human Genome Project Broad field Genomics, proteomics, systems biology This talk: Whole Genome Sequence (WGS) analysis Reading the letters of your DNA (ATCG ) 37