Howdah. a flexible pipeline framework and applications to analyzing genomic data. Steven Lewis PhD

Similar documents
Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MI-PDB, MIE-PDB: Advanced Database Systems

Processing of big data with Apache Spark

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Hadoop Map Reduce 10/17/2018 1

Map Reduce. Yerevan.

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Spark Overview. Professor Sasu Tarkoma.

15/03/2018. Counters

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Map Reduce & Hadoop Recommended Text:

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Developing MapReduce Programs

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Hadoop. copyright 2011 Trainologic LTD

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Big Data and Scripting map reduce in Hadoop

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Introduction to Map Reduce

In-Memory Performance Durability of Disk GridGain Systems, Inc.

Cloud Computing CS

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Map Reduce and Design Patterns Lecture 4

TI2736-B Big Data Processing. Claudia Hauff

Evolving To The Big Data Warehouse

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

Genome 373: Mapping Short Sequence Reads I. Doug Fowler

CS 345A Data Mining. MapReduce

Hadoop Online Training

CorrelX: A Cloud-Based VLBI Correlator

Clustering Lecture 8: MapReduce

MapReduce Design Patterns

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Introduction to Big-Data

1. Stratified sampling is advantageous when sampling each stratum independently.

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Introduction to Hadoop and MapReduce

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Prototyping Data Intensive Apps: TrendingTopics.org

itpass4sure Helps you pass the actual test with valid and latest training material.

microsoft

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Lecture 11 Hadoop & Spark

ML from Large Datasets

Chapter 5. The MapReduce Programming Model and Implementation

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Big data systems 12/8/17

Map-Reduce. Marco Mura 2010 March, 31th

A brief history on Hadoop

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Management and NoSQL Databases

MapReduce Algorithms

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Hadoop Streaming. Table of contents. Content-Type text/html; utf-8

Hadoop MapReduce Framework

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

The Evolution of a Data Project

Parallel Computing Ideas

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Application of machine learning and big data technologies in OpenAIRE system

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

2017 GridGain Systems, Inc. In-Memory Performance Durability of Disk

Graphs (Part II) Shannon Quinn

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Hadoop. Introduction / Overview

CS427 Multicore Architecture and Parallel Computing

Correlation based File Prefetching Approach for Hadoop

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Accelerate Big Data Insights

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Introduction to Data Management CSE 344

Deployment Planning Guide

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

MapReduce-style data processing

Collective Communication Patterns for Iterative MapReduce

Hadoop On Demand: Configuration Guide

Embedded Technosolutions

2/26/2017. For instance, consider running Word Count across 20 splits

Introduction to MapReduce Algorithms and Analysis

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Performance Analysis of Hadoop Application For Heterogeneous Systems

8: Memory Management

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Transcription:

Howdah a flexible pipeline framework and applications to analyzing genomic data Steven Lewis PhD slewis@systemsbiology.org

What is a Howdah? A howdah is a carrier for an elephant The idea is that multiple tasks can be performed during a single Map Reduce pass

Why Howdah? Many of the jobs we perform in biology are structured The structure of the data is well known The operations are well known The structure of the output is well known and simple concatenation will not work. We need to perform multiple operations with multiple output files on a single data set

Why Howdah? Many of the jobs we perform in biology are structured The structure of the data is well known The operations are well known The structure of the output is well known and simple concatenation will not work. We need to perform multiple operations with multiple output files on a single data set

General Assumptions The data set being processed is large and a Hadoop map-reduce step is relatively expensive The final output set is much smaller than the input data and is not expensive to process Further steps in the processing may not be handled by the cluster Output files require specific structure and formatting

Why Not Cascade or Pig Much or the code in biological processing is custom Special Formats Frequent exits to external code such as python Output must be formatted and usually outside of HDFS

Job -> Multiple Howdah Tasks Howdah tasks pick up data during a set of Map- Reduce jobs Task own their data prepending markers to the keys Task may spawn multiple sub-tasks Tasks (and subtasks) may manage their ultimate output Howdah tasks exist at every phase of a job including pre and post launch

Setup Map1 Howdah Tasks Partition Task Break SubTask Statistics Subtask SNP SubTask Reduce1 SNSP Task Consolidation Output Output

Task Life Cycle Tasks are created by reading an array of Strings in the job config. The strings creates instances of Java classes and sets parameters All tasks are created before the main job is run to allow each task to add configuration data. Tasks are created in all steps of the job but are often inactive.

Howdah Phases Job Setup before the job starts sets up input files, configuration, distributed cache Processing Initial Map data incoming from files Map(n) Subsequent maps data assigned to a task Reduce(n) data assigned to a task Consolidation data assigned to a path

Critical Concepts Looking at the kinds of problems we were solving we found several common themes Multiple action streams in a single process Broadcast and Early Computation Consolidation

Broadcast The basic problem Consider the case where all reviewers need access to a number of global totals. Sample - a Word Count program wants to not only output the count but the fraction of all words of a specific length represented by this word. Thus the word "a" might by 67% of all words of length 1. Real a system is interested in the probability that a reading will be seen. Once millions of readings have been observed, probability is the fraction of readings whose values are >= the test reading.

Maintaining Statistics For all such cases the processing needs access to global data and totals Consider the problem of counting the number of words of a specific length. It is trivial for every mapper to keep count of the number of words of a particular length observed. It is also trivial to send this data as a key/value pair in the cleanup operation.

Getting Statistics to Reducers For statistics to be used in reduction two conditions need to be met: 1. Every reducer must receive statistics from every mapper 2. All statistics must be received and processed before data requiring the statistics is handled

Broadcast Every Mapper sends its total to each reducer reducer makes grand total before other keys sent Mapper1 Mapper2 Mapper3 Mapper4 Totals Totals Totals Totals Reducer1 Reducer2 Reducer3 Reducer4 Reducer5 Grand Total Grand Total Grand Total Grand Total Grand Total

Consolidators Many perhaps most map reduce jobs take a very large data set and generate a much smaller set of output. While the large set is being reduced it makes sense to have each reducer write to a separate file part-r- 00000, part-r-00001 independently and in parallel. Once the smaller output set is generated it makes sense for a single reducer to gather all input and write a single output file of a format of use to the user. These tasks are called consolidators.

Consolidators Consolidation is the last step Consolidators can generate any output a in any location and frequently write off the HDFS cluster A single Howdah job can generate multiple consolidated files all output to a given file is handled by a single reducer

Consolidation Process Consolidation mapper assigns data to a output file Key is original key prepended with file path Consolidation Reducer receives all data for an output file and writes that file using a path. The format of the output is controlled by the consolidator. Write is terminated by cleanup or receiving data for a different file

Biological Problems Shotgun DNA Sequencing DNA is cut into short segments The ends of each segment is sequenced Each end of a read is fit to the reference sequence reference ACGTATTACGTACTACTACATAGATGTACAGTACTACAATAGATTCAAACATGATACA Sequences with ends fit to reference ATTACGTACTAC...... ACAGTACTACAA CGTATTACGTAC. ACTACAATAGATT ACTACTACATA...CAATAGATTCAAA

Processing Group all reads mapping to a specific region of the genome Find all reads overlapping a specific location on the genome Location is Chromosome : location i.e. CHRXIV:3426754

Single Mutation Detection Most sequences agree in every position with the reference sequence When many sequences disagree with the reference in one position but agree with each other a single mutation is suspected SNP reference ACGTATTACGTACTACTACATAGATGTACAG ACGTATTACGTAC TTTTACGTACTACTA GTTTTACGTACTAC TTTTACGTACTACATAG CGTATTACGTACTACTA

Deletion Detection Deletes are detected when the ends of a read are fitted to regions of the reference much further apart than normal The fit is the true length plus the deleted region reference ACGTATTACGTACTACTACATAGATGTACAGTACTACAATAGATTCAAACATGATACAACACACAGTA actual deletion ACGTATTACGTACTAC TCAAACATGATACAACACACAGTAAGATAGTTACACGTTTATATATATACC fit to actual ATTACGTACTAC...???..TACAACACACAG reported fit to reference ATTACGTACTAC......TACAACACACAG

Performance Platforms Local 4 Core Core2 Quad 4Gb Cluster 10 node cluster running Hadoop - 8 core per node 24Gb Ram 4 TB Disk AWS small node cluster (nodes specified) 1gb virtual

Data Platform Task Timing Local 1 cpu 200 M 15 min Local 1 cpu 2 GB 64 min 10 node cluster 2 GB 1.5 min 10 node cluster 1 TB 32 min AWS 3 small 2 GB 7 Min AWS 40 small 100GB 800 Min

Conclusion Howdah is useful for tasks where: a large amount of data is processed into a much smaller output set Multiple analyses and outputs are desired for the same data set The format of the output file is defined and cannot simply be a concatenation Complex processing of input data is required sometimes including broadcast global information

Questions

Critical Elements Keys are enhanced by prepending a task specific ID Broadcast is handled by prepending an id that sorts before non-broadcast ids Consolidation is handled by prepending a file path and using a partitioner which assures that all data in one file is sent to the same reducer.