Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud

Size: px
Start display at page:

Download "Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud"

Transcription

1 212 Cairo International Biomedical Engineering Conference (CIBEC) Cairo, Egypt, December 2-21, 212 Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud Rawan AlSaad and Qutaibah Malluhi Department of Computer Science and Engineering Qatar University, Doha, Qatar {rawana, Mohamed Abouelhoda Cairo University, Giza, Egypt Nile University, Giza, Egypt Abstract This paper presents a methodology for running NGS read mapping tools in the cloud environment based on the MapReduce programming paradigm. As a demonstration, the recently developed and robust sequence alignment tool, BFAST, is used within our methodology to handle massive datasets. The results of our experiments show that the transformation of existing read mapping tools to run within the MapReduce framework dramatically reduces the total execution time and enables the user to utilize the resources provided by the cloud. Index Terms Cloud computing, MapReduce, bioinformatics, sequence alignment. I. INTRODUCTION Rapidly evolving Next Generation Sequencing (NGS) technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is the sequence alignment analysis, whereby sequence reads must be mapped (aligned) to a reference genome. The data sizes produced by the nextgeneration sequencing machines insinuate that parallelism is essential to process the DNA sequences in a timely fashion. The problem of mapping NGS reads to a reference genome is naturally data parallel, and the reads can be independently processed. This problem structure suggests the utilization of the MapReduce paradigm together with its available open source implementation Hadoop, which is well suited to handle such data intensive jobs. The introduction of Hadoop implementation over cloud computing platforms added another interesting advantage, where the users can scale their infrastructure (in terms of number and type of machines) according to their needs and can use the services on a pay-asyou-go basis. Currently, Amazon AWS directly supports the creation of Hadoop-based clusters with special prices, but the users of private cloud platforms and other commercial ones, like Microsoft, can readily install Hadoop and use it. A handful of Hadoop-based projects for read mapping running in the cloud have been lunched in this area: These include CloudBurst [1], Crossbow [2], CloudAligner [3], and the work of [4]. Yet most of these projects are based on rebuilding the alignment tools from scratch to fit the cloud paradigm, which is in general a very difficult problem to tackle. Also, it is not possible to replace the alignment (read mapping) module in these tools with more advanced ones without recoding the alignment tools. In this paper, we present a methodology for easy and efficient transformation of existing NGS mapping tools into the cloud environment based on the MapReduce programming model [5], [6] without the need for recoding these tools. The essence of this methodology is to use the MapReduce paradigm to partition the sequence alignment problem into a large number of sub-problems which can run independently in parallel, and with minimum inter-task communication. Critical to the function and performance of our methodology is the implementation of a scheme that allows the cloudified NGS alignment tool to run as a black box within the MapReduce model, without the need for building new parallel algorithms or recoding these tools from scratch. By separating specific sequence analysis calculations from common data management infrastructure, tools can benefit from the ongoing improvements to the cloud computing paradigm, MapReduce programming model, and the DNA sequence alignment tools. In this context, the paper makes the following contributions: Develop a methodology with techniques and mechanisms to ease porting existing NGS alignment tools into the cloud environment. Implement the MapReduce model, which enables a parallelized run of the cloudified NGS alignment tools as a black box within the MapReduce framework, with little or no changes to the original code of these tools. Demonstrate the utility of the proposed methodology by transforming two of the recently developed and commonly used NGS alignment tools, which are BFAST [7] and SHRiMP [8], to the MapReduce model on the cloud. Demonstrate the efficiency and scalability of the proposed methodology to large problem sizes. The rest of the paper is organized as follows: Section 2 provides basic background about the read mapping problem, MapReduce and Hadoop. Section 3 describes the approach used to conduct this study demonstrated by the BFAST tool. Section 4 presents the experiments and results used to evaluate the efficiency and scalability of the proposed methodology, and Section 5 concludes the paper /12/$ IEEE 18

2 II. BACKGROUND Sequence Alignment Software Tools: NGS data is composed of short segments of sequenced DNA. Each segment is called a read and its length varies according to the sequence technology used, and it ranges between tens to hundreds of base pairs (characters). With recent NGS technologies, one genome sequencing project produces millions of reads with dataset sizes in the range of billions of base pairs. The NGS read mapping problem is to align (map) a set of NGS reads to a reference genomic sequence. The goal of any software tool solving this problem is to find the location of each read in the reference sequence (if exists), where the alignment of the read and the respective subsequence at this location have the minimum number of edit operations (replacements, insertions, and deletions) or the alignment score exceeds a certain threshold. A wide variety of alignment software tools have been developed over the past few years to solve the NGS read mapping problem. These include among others BFAST [7], SHRiMP [8], Bowtie [9], BWA [1], and SOAP2 [11]. In this research, we have selected the tool BFAST to demonstrate the proposed methodology. This tool is open-source software designed for aligning Giga-scale short read sets with comparable or better speed compared to existing methods, while maintaining higher sensitivity and accuracy for deletions/insertions. The underlying method of BFAST is based on creating flexible and efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses the Smith-Waterman [12] algorithm with gaps to support the detection of similarities and dissimilarities regions. More details about the BFAST is described later in this paper. MapReduce: MapReduce is a parallel programming model over a computer cluster composed of a number of computing nodes for processing large datasets. The power of the MapReduce computational paradigm is that it can intelligently distribute computations across a cluster with hundreds or thousands of computers, each analyzing a portion of the dataset stored locally on the compute node. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required intermachine communication. This programming model allows the application developer to focus on what computations need to be performed, as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them. The core concept of the MapReduce framework is that input is split into logical chunks, and each chunk may be initially processed independently by a map task. During the map process, the input file(s) is (are) split up into many smaller pieces. Each piece is sent to a node in the cluster to be computed. The map process creates <key, value> pairs and then uses a mapper task. The mapper task is a piece of code defined by software developer based on the application business, and specifies what should be done to the <key, value> pairs.the reduce task, also developed by the software developer, collects all of the solutions from each of the nodes in the cluster and combines them into one file based on the defined reduction criteria. The power of MapReduce is that the map and reduce functions are executed in parallel over potentially hundreds or thousands of processors with minimal effort by the application developer. Hadoop: Hadoop is a popular open-source implementation of the MapReduce model written in Java for cross-platform portability [13]. A key component of Hadoop is the Hadoop Distributed File System (HDFS), which enables efficient management of data files and sharing them among the nodes of the computer cluster. Amazon provides the Elastic MapReduce (EMR) product which is basically a computer cluster with Hadoop installed on it. III. METHODOLOGY The methodology is designed to ease the cloudification of existing NGS alignment tools to run in the cloud environment using the programming paradigm of MapReduce. Before describing the MapReduce solution, we briefly introduce the main steps of the BFAST tool, which is likely to be very similar to the steps used by most of the sequence alignment software tools in the field. The sequence alignment process performed by BFAST is divided into four main steps. The first step is to create the indexes for the reference genomes, the number and layout of these indexes is determined both by the user s speed and accuracy requirements. The second step is to find Candidate Alignment Locations (CALs) for each read. The expected number of CALs returned is a function of the number of indexes and the layouts chosen in the first step as well as the number offsets. The third step is to fully align each CAL for each read. The fourth and final step is to filter and prioritize the final alignments. The user specifies criteria to select the correct alignment for each read. The criteria can be based on many factors, including uniqueness, score, or other factors. MapReduce divides the computations into two separate steps; map and reduce. Inthemap step, the larger sequence alignment problem is divided into many smaller independent sub-problems, which are fed to the map function where the sequence alignment tools are plugged-in. The output of the map level, which is the partial alignment results, is then passed to the reduce function to merge the partial results and produce the final output. Calculations like finding the CALs, extending the CALs using local alignment algorithms, and prioritizing the alignments naturally operate at the map level of MapReduce as they perform calculations at each chunk of the short-reads files independently. The setup used in our methodology to achieve the parallelized execution of the sequential BFAST program in the MapReduce framework is that the execution of the entire BFAST alignment stages for a subset of the short-reads sequence is assigned to each mapper. The reduce stage in our case will simply pass its input to its output with no changes. However, in other bioinformatics applications, the reducer 19

3 might be further utilized to perform more complicated analysis on the data. The parallelization approach consists of segmenting the short-reads sequence input into subsets of the same size and running multiple instances of the BFAST black box on each subset. The reference genome input is replicated on all the allocated cloud nodes, so that it is available for all the instances of the BFAST program as part of the execution environment. Figure 1 explains the parallelization approach described above. The input short-reads sequence S is divided into subsets of equal sizes, S= {S1, S2,..., Sn, each of which is passed into one of the worker nodes. On each worker node, the indexes of the reference genome R= {R1, R2,, Rm are resident. For human genome, for example, we have 22 files representing the 22 chromosomes in the human body, and we have created five indexes for each of the chromosomes files. In our implementation, the read files are split into blocks of 64 MB, which is equal to the HDFS default block size. The algorithm listed in Fig.2 lay-outs the mapper function used in our methodology. The reference and read files are assigned automatically to each Mapper function by the Hadoop system. The MappingTool is a function that invokes the mapping program (like BFAST) with the respective arguments, input sequences, and indexes of reference genome(s). There are two main steps which are executed before and after the mapper function is invoked; pre-processing and postprocessing Steps. The pre-processing step includes two main phases: 1) setup phase for preparing the reference genome, and 2) loading the short-reads into the HDFS. In the setup phase, the reference genome files are uploaded to all the cluster nodes, where each Input Sequence S= {S 1,S 2,..., S n Mapper( ) { for each reference file Ri in R { for each sequence read Si in S{ MappingTool(Si, Ri, ARG) Fig.2. Mapper function Fig.2. Mapping function Node 1 BFAST (S 1 vs R 1 ) BFAST (S 1 vs R 2 )... BFAST (S 1 vs R m ) Node 2 BFAST (S 2 vs R 1 ) BFAST (S 2 vs R 2 )... BFAST (S 2 vs R m ) Node N BFAST (S n vs R 1 ) BFAST (S n vs R 2 )... BFAST (S n vs R m ) Fig.1. Parallelization of BFAST node constructs its copy of the index, if it is not already constructed before. In general, all read mapping programs include the step for creating an index for the reference genome to speed up the computation. (The tools do not use the same indexing data structure, but different ones.) It is also worth mentioning that the indexes are created only once and used thereafter for all subsequent processing. In the second phase, the short-reads files are uploaded into the cloud through a connection to the HDFS. The pre-processing phase is handled by the user according to our tool manual. The post-processing step is used to prepare the final output of the mapping process. Typical post-processing work includes removing duplicates, prioritizing the alignments scores, and selecting the best alignments. There are two options for the placement of the post-processing step; it can exist either as an integrated part of the MappingTool black box, or as a separate module which is independent of the MappingTool black box at either the map or reduce phase. In our implementation, the post-processing step exists as part of the BFAST black box. The criteria for selecting the correct alignment for each read are specified by the user. The criteria can be based on many factors such as uniqueness or scoring functions. The mapping output is collected after all the computations are completed and stored into a location on HDFS. These are the main characteristics of our proposed methodology: A. Pushing Environment to Data One of the core principles in the MapReduce model is to push code to data. Therefore, map functions move to nodes that hold the data on which the map will work. Our methodology calls for mechanisms to push not only the code, but also the entire execution environment to be close to the data. The execution environment includes the code, the MappingTool black box, the reference genome data, and the temporary directories created to handle the MappingTool sideeffect files. This is an important procedure in the era of massive data sets and open source softwares. This approach works very well when the execution environment is smaller than the data one wishes to analyze, but still have some limitations and incurs some overhead when the execution environment grows to be larger than the data. In our case, this approach works very well because the size of the reads data is much larger than the size of the execution environment. B. Cloudification without Recoding By using the MapReduce, Hadoop, and HDFS we were able to capitalize on the technical advantages conferred by MapReduce/Hadoop, without having to recode our own sequence analysis algorithms and workflows, and without having to design our own solutions for job queuing, tracking, and maintenance. Although the implementation of the BFAST tool is not in Java, and the MapReduce and Hadoop frameworks are initially designed to use Java, we were able to run the C codes of BFAST within the MapReduce/Hadoop Java frameworks using the Java Native Interface (JNI). 2

4 C. Data Pipelining across Methodology Modules Pipelining the input, intermediate, and output data between the different modules was a challenging task due to the nature of the abstraction level used in the MapReduce and Hadoop frameworks. The data pipelining in our methodology is managed using the Named Pipes which allow different processes to communicate to each other, and temporary directories created on each of the cluster nodes to locally store the intermediate MappingTool side-effect files. Only the partial alignment results are prompted to the reducer. Other intermediate MappingTool side-effect files arediscarded. D. Multiple Sources for the Mapper Input In typical MapReduce implementations the input of the mapper is expected to reside within an input directory on the HDFS. However, given the complex stream of operations needed by the sequence alignment tools, the standard storage and flow of the mapper input is modified in our methodology to allow for more flexibility. This is done by allowing for two different sources for the mapper input; the first is the standard HDFS input directory, and the second is the input directory integrated as part of the execution environment which stores the pre-processed reference genome files together with its indexes. IV. EXPERIMENTS AND RESULTS We conducted a number of experiments to demonstrate the efficiency and scalability of our methodology using BFAST (version.6.4f) and SHRiMP (version 2.2.2). We used sets of publicly available Illumina sequencing reads from the 1 Genomes Project (accession: SRX6833, Illumina paired-end sequencing of an African male individual) and the human genome publically available at UCSC [14]. The experiments were performed on 16-nodes Hadoop cluster testbed using the resources available by the IBM blue cloud 1.6 installed at Qatar University. Each node includes one 2.8 GHz Intel processor, 4 GB RAM, and 8 GB local disk space. The compute nodes were running Hadoop In all the experiments, the time to build the reference genome indexes and load it into the HDFS is excluded, since this task is done once and the indexes can be read in sub-sequent experiments. The first experiment explores how well the MapReduced BFAST scales as the number of reads increases. In this experiment all the 16 nodes were used to align subsets of the reads to the human chromosome 22. Fig.3 shows the runtime of these tasks, runtime is averaged over 3 runs. The results show that the execution time of the MapReduced BFAST scales linearly as the number of reads increases, as expected as the reads are independent. The second experiment evaluates how well the MapReduced BFAST scales as the number of nodes increases for a fixed problem size. Fig.4 shows the execution time of the MapReduced BFAST for mapping 5 Mbp reads on a cluster size ranging between 2 and 16 nodes. Each runtime value is averaged over the readings of 3 runs. The results show that as the number of nodes doubles, the execution time drops by almost 5%. The graph also shows that the overhead for Hadoop to schedule to all of the compute nodes, process the reads files, and map the first reads to the compute nodes is constant regardless of the total number of available nodes. The third experiment compares the run time of the parallel execution of the MapReduced version of BFAST and SHRiMP using our methodology with the serial execution of the original implementation of these two tools on the same computing environment for each tool separately. The first part of this experiment compares the run time of the parallel execution of the MapReduced BFAST on 8 nodes with the serial execution of the original BFAST. The run time curves in Fig.5 show the results of this test in which both implementations of BFAST are executed for mapping 5-3 Mbp reads to the human chromosome 22 (each runtime value is averaged over 3 runs). The second part of this experiment compares the run time of the parallel execution of the MapReduced SHRiMP on 8 nodes with the serial execution of the original SHRiMP. The run time curves in Fig.6 show the results for this test in which the two implementations of SHRiMP are executed for mapping 2-32 Kbp reads to the human chromosome22 (each runtime value is averaged over 3 runs). The results of these two experiments show that the MapReduce implementation of both tools, BFAST and SHRiMP, outperforms the original implementation with the same speedup which is 7x. The performance gain of the MapReduce implementation of these tools is mainly from the partitioning and parallel processing of the huge reference genome as well as the reads Number of Bases (Mbp) Fig.3. BFAST Running Time vs. Number of Reads Number of Nodes Fig.4. BFAST Running Time vs. Number of Nodes 21

5 MapReduced BFAST Number of bases (Mbp) Original BFAST Fig.5. MapReduced BFAST vs. Original BFAST MapReduced SHRiMP Number of bases (Kbp) Original SHRiMP [4] A. Bateman and M. Wood, "Cloud computing," in Bioinformatics 29. [5] J. Dean and S. Ghemawat, "MapReduce: Simplifed data processing on large clusters," in In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 24), San Francisco, 24, pp [6] J. Ekanayake, S. Pallickara, and G. Fox, "MapReduce for Data Intensive Scientific Analyses," in Fourth IEEE International Conference on escience, 28, pp [7] N. Homer, B. Merriman, S. Nelson, and C. Creighton, "BFAST: An Alignment Tool for Large Scale Genome Resequencing," PLoS ONE, vol. 4, no. 11, 29. [8] S. Rumble et al., "SHRiMP: accurate mapping of short colorspace reads," PLoS Computational Biology, May 29. [9] B. Langmead, C. Trapnell, M. Pop, et al. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol 29;1:R25. [1] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 29;25: [11] R. Li, C. Yu, Y. Li, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 29;25: [12] T. Smith and M. Waterman, "Identification of Common Molecular Subsequences," Journal of Molecular Biology, pp , [13] J. Venner, Pro Hadoop.: Apress, 29. [14] (29, February) Human Reference Genome, FASTA sequence of each chromosome. [Online]. Fig.6. MapReduced SHRiMP vs. Original SHRiMP V. CONCLUSION We presented an approach for running read mapping tools within the MapReduce framework to efficiently process large datasets in parallel on a computer cluster. Our methodology is generic and can be used with any read mapping tool without recoding it. Furthermore, our approach can be ported to any cloud environment that accommodates standard Linux distributions running Hadoop. Thus, we expect that by utilizing an approach similar to ours, more and more applications will take advantage of the cloud environments in the near future. We have evaluated this by applying the same approach on a different sequence alignment tool, which is SHRiMP. The parallelization of SHRiMP using our approach was a very easy task, as expected. Moreover, the experimental results show that using this approach will produce good scalability and performance enhancement results. REFERENCES [1] M. Schatz, "CloudBurst: highly sensitive read mapping with MapReduce," Bioinformatics, vol. 25, no. 11, pp , June 29. [2] H. Li and N. Homer, "A Survey of Sequence Alignment Algorithms for Next Generation Sequencing," Briefings in Bioinformatics, 21. [3] T. Nguyen, W. Shi, and D. Ruden, "CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping," BMC Res Notes,

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

CloudBurst: Highly Sensitive Read Mapping with MapReduce

CloudBurst: Highly Sensitive Read Mapping with MapReduce Bioinformatics Advance Access published April 8, 2009 Sequence Analysis CloudBurst: Highly Sensitive Read Mapping with MapReduce Michael C. Schatz* Center for Bioinformatics and Computational Biology,

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338

More information

An improved MapReduce Design of Kmeans for clustering very large datasets

An improved MapReduce Design of Kmeans for clustering very large datasets An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb

More information

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics, Pervasive Technology Institute Indiana University

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Halvade: scalable sequence analysis with MapReduce

Halvade: scalable sequence analysis with MapReduce Bioinformatics Advance Access published March 26, 2015 Halvade: scalable sequence analysis with MapReduce Dries Decap 1,5, Joke Reumers 2,5, Charlotte Herzeel 3,5, Pascal Costanza, 4,5 and Jan Fostier

More information

MapReduce for Data Intensive Scientific Analyses

MapReduce for Data Intensive Scientific Analyses apreduce for Data Intensive Scientific Analyses Jaliya Ekanayake Shrideep Pallickara Geoffrey Fox Department of Computer Science Indiana University Bloomington, IN, 47405 5/11/2009 Jaliya Ekanayake 1 Presentation

More information

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies

More information

Accelrys Pipeline Pilot and HP ProLiant servers

Accelrys Pipeline Pilot and HP ProLiant servers Accelrys Pipeline Pilot and HP ProLiant servers A performance overview Technical white paper Table of contents Introduction... 2 Accelrys Pipeline Pilot benchmarks on HP ProLiant servers... 2 NGS Collection

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2, Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Outline Introduction and Background MapReduce Iterative MapReduce Distributed Workflow Management

More information

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Processing Genomics Data: High Performance Computing meets Big Data. Jan Fostier

Processing Genomics Data: High Performance Computing meets Big Data. Jan Fostier Processing Genomics Data: High Performance Computing meets Big Data Jan Fostier Traditional HPC way of doing things Communication network (Infiniband) Lots of communication c c c c c Lots of computations

More information

SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform

SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform MTP End-Sem Report submitted to Indian Institute of Technology, Mandi for partial fulfillment of the degree of B. Tech.

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The

More information

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop Myoungjin Kim 1, Seungho Han 1, Jongjin Jung 3, Hanku Lee 1,2,*, Okkyung Choi 2 1 Department of Internet and Multimedia Engineering,

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Parallel data processing with MapReduce

Parallel data processing with MapReduce Parallel data processing with MapReduce Tomi Aarnio Helsinki University of Technology tomi.aarnio@hut.fi Abstract MapReduce is a parallel programming model and an associated implementation introduced by

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison Hindawi Publishing Corporation BioMed Research International Volume 213, Article ID 17356, 7 pages http://dx.doi.org/1.1155/213/17356 Research Article Cloud Computing for Protein-Ligand Binding Site Comparison

More information

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James

More information

Scalable RNA Sequencing on Clusters of Multicore Processors

Scalable RNA Sequencing on Clusters of Multicore Processors JOAQUÍN DOPAZO JOAQUÍN TARRAGA SERGIO BARRACHINA MARÍA ISABEL CASTILLO HÉCTOR MARTÍNEZ ENRIQUE S. QUINTANA ORTÍ IGNACIO MEDINA INTRODUCTION DNA Exon 0 Exon 1 Exon 2 Intron 0 Intron 1 Reads Sequencing RNA

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Processing Technology of Massive Human Health Data Based on Hadoop

Processing Technology of Massive Human Health Data Based on Hadoop 6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

High-performance short sequence alignment with GPU acceleration

High-performance short sequence alignment with GPU acceleration Distrib Parallel Databases (2012) 30:385 399 DOI 10.1007/s10619-012-7099-x High-performance short sequence alignment with GPU acceleration Mian Lu Yuwei Tan Ge Bai Qiong Luo Published online: 10 August

More information

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines A programming model in Cloud: MapReduce Programming model and implementation for processing and generating large data sets Users specify a map function to generate a set of intermediate key/value pairs

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Indexing Strategies of MapReduce for Information Retrieval in Big Data International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

CLIENT DATA NODE NAME NODE

CLIENT DATA NODE NAME NODE Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop

Computational Architecture of Cloud Environments Michael Schatz. April 1, 2010 NHGRI Cloud Computing Workshop Computational Architecture of Cloud Environments Michael Schatz April 1, 2010 NHGRI Cloud Computing Workshop Cloud Architecture Computation Input Output Nebulous question: Cloud computing = Utility computing

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University Azure MapReduce Thilina Gunarathne Salsa group, Indiana University Agenda Recap of Azure Cloud Services Recap of MapReduce Azure MapReduce Architecture Application development using AzureMR Pairwise distance

More information

Performance analysis of parallel de novo genome assembly in shared memory system

Performance analysis of parallel de novo genome assembly in shared memory system IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

DryadLINQ for Scientific Analyses

DryadLINQ for Scientific Analyses DryadLINQ for Scientific Analyses Jaliya Ekanayake 1,a, Atilla Soner Balkir c, Thilina Gunarathne a, Geoffrey Fox a,b, Christophe Poulain d, Nelson Araujo d, Roger Barga d a School of Informatics and Computing,

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

Large Scale Computing Infrastructures

Large Scale Computing Infrastructures GC3: Grid Computing Competence Center Large Scale Computing Infrastructures Lecture 2: Cloud technologies Sergio Maffioletti GC3: Grid Computing Competence Center, University

More information

Nowadays data-intensive applications play a

Nowadays data-intensive applications play a Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted

More information

Evaluating Private Information Retrieval on the Cloud

Evaluating Private Information Retrieval on the Cloud Evaluating Private Information Retrieval on the Cloud Casey Devet University ofwaterloo cjdevet@cs.uwaterloo.ca Abstract The goal of Private Information Retrieval (PIR) is for a client to query a database

More information

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research Apache Spark and Hadoop Based Big Data Processing System for Clinical Research Sreekanth Rallapalli 1,*, Gondkar R R 2 1 Research Scholar, R&D Centre, Bharathiyar University, Coimbatore, Tamilnadu, India.

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

A GPU Algorithm for Comparing Nucleotide Histograms

A GPU Algorithm for Comparing Nucleotide Histograms A GPU Algorithm for Comparing Nucleotide Histograms Adrienne Breland Harpreet Singh Omid Tutakhil Mike Needham Dickson Luong Grant Hennig Roger Hoang Torborn Loken Sergiu M. Dascalu Frederick C. Harris,

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters. Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis

Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis Mian Lu, Yuwei Tan, Jiuxin Zhao, Ge Bai, and Qiong Luo Hong Kong University of Science and Technology {lumian,ytan,zhaojx,gbai,luo}@cse.ust.hk

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture

Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture Dong-hyeon Park, Jon Beaumont, Trevor Mudge University of Michigan, Ann Arbor Genomics Past Weeks ~$3 billion Human Genome

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

MAIN DIFFERENCES BETWEEN MAP/REDUCE AND COLLECT/REPORT PARADIGMS. Krassimira Ivanova

MAIN DIFFERENCES BETWEEN MAP/REDUCE AND COLLECT/REPORT PARADIGMS. Krassimira Ivanova International Journal "Information Technologies & Knowledge" Volume 9, Number 4, 2015 303 MAIN DIFFERENCES BETWEEN MAP/REDUCE AND COLLECT/REPORT PARADIGMS Krassimira Ivanova Abstract: This article presents

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Data Mining of Genomic Data using Classification Algorithms on MapReduce Framework: A Survey

Data Mining of Genomic Data using Classification Algorithms on MapReduce Framework: A Survey Data Mining of Genomic Data using Classification Algorithms on MapReduce Framework: A Survey Rajarshi Banerjee 1, Ravi Kumar Jha 1, Aditya Neel 1, Rituparna Samaddar (Sinha) 1 and Anindya Jyoti Pal 1 1

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information