High Performance and Cloud Computing (HPCC) for Bioinformatics

Similar documents
High Performance and Cloud Computing (HPCC) for Bioinformatics

Everything you need to know about cloud. For companies with people in them

Programowanie w chmurze na platformie Java EE Wykład 1 - dr inż. Piotr Zając

Introduction To Cloud Computing

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CS 6393 Lecture 10. Cloud Computing. Prof. Ravi Sandhu Executive Director and Endowed Chair. April 12,

Distributed Systems. 31. The Cloud: Infrastructure as a Service Paul Krzyzanowski. Rutgers University. Fall 2013

January Cloud & Xaas. When computers flies in the sky. Jérôme Blanchard Research Engineer ATILF, CNRS

Big Data and Cloud Computing

Introduction to Cloud Computing

Introduction to Cloud Computing. [thoughtsoncloud.com] 1

Mobile Cloud Computing

Lecture 09: VMs and VCS head in the clouds

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Cloud Computing 4/17/2016. Outline. Cloud Computing. Centralized versus Distributed Computing Some people argue that Cloud Computing. Cloud Computing.

White Paper Impact of DoD Cloud Strategy and FedRAMP on CSP, Government Agencies and Integrators.

Cloud Computing introduction

Moving to the Cloud: Making It Happen With MarkLogic

UVA HPC & BIG DATA COURSE. Cloud Computing. Adam Belloum

OpenStack Seminar Disruption, Consolidation and Growth. Woodside Capital Partners

MapReduce for Scalable and Cloud Computing

CSE6331: Cloud Computing

Computing as a Service

Module Day Topic. 1 Definition of Cloud Computing and its Basics

Top 40 Cloud Computing Interview Questions

Automated Deployment of Private Cloud (EasyCloud)

Analytics in the Cloud Mandate or Option?

Next-Generation Cloud Platform

MapReduce for Scalable and Cloud Computing

Building a Data-Friendly Platform for a Data- Driven Future

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Introduction to Big-Data

Automated Deployment of Private Cloud (EasyCloud)

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Cloud Computing. January 2012 CONTENT COMMUNITY CONVERSATION CONVERSION

Community Clouds And why you should care about them

Falling Out of the Clouds: When Your Big Data Needs a New Home

Microsoft Big Data and Hadoop

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

HPC over Cloud. July 16 th, SCENT HPC Summer GIST. SCENT (Super Computing CENTer) GIST (Gwangju Institute of Science & Technology)

Cloud Computing & Visualization

Main Frame Dial Up (1960 s)

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Cloud Computing Technologies and Types

Cloud Computing and Service-Oriented Architectures

HPC Cloud at SURFsara

Renovating your storage infrastructure for Cloud era

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

BUT HOW DID THE CLOUD AS WE KNOW IT COME TO BE AND WHERE IS IT GOING?

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Demystifying the Cloud With a Look at Hybrid Hosting and OpenStack

Contents PART I: CLOUD, BIG DATA, AND COGNITIVE COMPUTING 1

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Building your Castle in the Cloud for Flash Memory

AWS Serverless Architecture Think Big

Open Hybrid Cloud & Red Hat Products Announcements

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Flash in a Hybrid Cloud World. How Cloud Shift will affect flash in the Data Center Steve Knipple: Cloud Shift Advisors

Embedded Technosolutions

Tech Talk #11. Public Cloud UNIVERSITY OF COLORADO AT BOULDER 12/14/16 CU TECH TALK #11

CLOUD COMPUTING. Lecture 4: Introductory lecture for cloud computing. By: Latifa ALrashed. Networks and Communication Department

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Leveraging the Cloud for Law Enforcement. Richard A. Falkenrath, PhD Principal, The Chertoff Group

Course 20533B: Implementing Microsoft Azure Infrastructure Solutions

Mobile Cloud Computing

How to Move Your Oracle Database to The Cloud. Clay Jackson Database Solutions Sales Engineer

Deploying Applications on DC/OS

SCALABLE DISTRIBUTED DEEP LEARNING

1/10/2011. Topics. What is the Cloud? Cloud Computing

Big Data Hadoop Stack

SURVEY PAPER ON CLOUD COMPUTING

Cloud Computing Overview. The Business and Technology Impact. October 2013

Architekturen für die Cloud

Sensor Data Collection and Processing

Middle East Technical University. Jeren AKHOUNDI ( ) Ipek Deniz Demirtel ( ) Derya Nur Ulus ( ) CENG553 Database Management Systems

Microsoft Analytics Platform System (APS)

SOFTWARE DEFINED STORAGE VS. TRADITIONAL SAN AND NAS

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

The End of Storage. Craig Nunes. HP Storage Marketing Worldwide Hewlett-Packard

Big Data and Object Storage

Cloud Computing. Technologies and Types

MOHA: Many-Task Computing Framework on Hadoop

IBM Bluemix compute capabilities IBM Corporation

OPENSTACK PRIVATE CLOUD WITH GITHUB

Scaling DreamFactory

CLOUD COMPUTING ABSTRACT

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

[MS10992]: Integrating On-Premises Core Infrastructure with Microsoft Azure

Reviewing Nist Cloud Computing Definition

Improving the MapReduce Big Data Processing Framework

Azure SQL Database Basics

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Course Overview. ECE 1779 Introduction to Cloud Computing. Marking. Class Mechanics. Eyal de Lara

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

Transcription:

High Performance and Cloud Computing (HPCC) for Bioinformatics King Jordan Georgia Tech January 13, 2016 Adopted From BIOS-ICGEB HPCC for Bioinformatics 1

Outline High performance computing (HPC) Cloud computing HPC vs. Cloud computing Cloud computing for bioinformatics 2

HPC Overview: Client-server architecture 3

HPC Overview: Supercomputer clusters A computer cluster is a single logical unit consisting of multiple computers that are linked through a local area network (LAN). The networked computers essentially act as a single, much more powerful machine. A computer cluster provides much faster processing speed, larger storage capacity, better data integrity, superior reliability and wider availability of resources. Computer clusters are, however, much more costly to implement and maintain. This results in much higher running overhead compared to a single computer. (This is where cloud computing comes in ) http://www.techopedia.com/definition/6581/computer-cluster 4

HPC Overview: Parallel computing Parallel computing is a type of computing architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing large computations by dividing the workload between more than one processor, all of which work through the computation at the same time. Most supercomputers employ parallel computing principles to operate. Parallel computing is also known as parallel processing. http://www.techopedia.com/definition/8777/parallel-computing 5

HPC @ GA Tech: PACE (Partnership for an Advanced Computing Environment) 1,200 nodes with 30,000 CPU cores 90 terabytes of memory 2 Petabytes of online commodity storage 215 terabytes of high-performance scratch storage 6

What is Cloud Computing? How is it related to HPC? How does it differ from traditional HPC? 7

What is Cloud Computing (skeptical view) "The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do. I can't think of anything that isn't cloud computing with all of these announcements. The computer industry is the only industry that is more fashion-driven than women's fashion. Maybe I'm an idiot, but I have no idea what anyone is talking about. What is it? It's complete gibberish. It's insane. When is this idiocy going to stop?" Larry Ellison, CEO Oracle, OracleWorld 2008 https://www.youtube.com/watch?v=0facyai6dy0 Paul Hodor B A H 8

Moving towards a more specific definition of Cloud Computing In 2011 the National Institute of Standards and Technology (NIST) issued Special Publication 800-145, "The NIST definition of cloud computing Intended as a means for broad comparisons of cloud services and deployment strategies to provide a baseline for discussion on what cloud computing is and how it is used Defines the following categories of concepts Essential characteristics Service models Deployment models Paul Hodor B A H 9

Essential characteristics of cloud computing (NIST) On-demand self-service Broad network access Resource pooling Rapid elasticity Measured service Paul Hodor B A H 10

Service models of Cloud Computing (NIST) Software as a Service (SaaS) The capability to use the provider's applications remotely over the network. The user does not manage the server, operating system, storage, even application capabilities. Platform as a Service (PaaS) The capability to deploy and use user-created or acquired applications on infrastructure made available by the provider. The user has control over deployed applications and their configuration, but does not manage servers, operating system, or storage. Infrastructure as a Service (IaaS) The capability to provision computing resources, storage networking, on which to deploy arbitrary software. The user has virtual control over all resources, but does not have control over the physical infrastructure. Paul Hodor B A H 11

Service models of Cloud Computing (NIST) Private cloud Community cloud Public cloud Hybrid cloud Paul Hodor B A H 12

Cloud Computing can also be considered as a kind of Commodity Computing Use of large numbers of already-available computing components for parallel computing, to get the greatest amount of useful computation at low cost. Computing done in commodity computers as opposed to high-cost supercomputers or boutique computers Commodity computers are computer systems manufactured by multiple vendors, incorporating components based on open standards Such systems are said to be based on commodity components, since the standardization process promotes lower costs and less differentiation among vendors' products http://en.wikipedia.org/wiki/commodity_computing 13

Cloud Computing was made possible by the convergence of three existing technologies The internet Research on packet networking funded in the 1960s TCP/IP introduced in the 1980s Opening to commercial traffic 1990-1995 Virtualization Early work by IBM in the 1960s Hardware virtualization becomes mainstream in the early 2000s Parallel computing First multi- processor computers in the 1960s Birth of the Message Passing Interface (MPI) in 1992 MapReduce paper published in 2004 Paul Hodor B A H 14

HPC versus Cloud Computing Models Traditional HPC model (Physical data center) Buy a bunch of server boxes Add hard drives for storage Connect servers with cables into an intranet Install an operating system and applications Log in remotely and start working ssh user@mydomain.com Paul Hodor B A H 15

HPC versus Cloud Computing Models Traditional HPC model (Physical data center) Buy a bunch of server boxes Add hard drives for storage Connect servers with cables into an intranet Install an operating system and applications Log in remotely and start working ssh user@mydomain.com Cloud Computing model (Virtual data center) Provision a bunch of instances Attach virtual volumes for storage Create a virtual private cloud Launch a machine image Log in remotely and start working ssh user@mydomain.com Paul Hodor B A H 16

Cloud computing: Available platforms Lavanya Rishishwar GATech 17

Cloud computing: Available platforms Amazon Web Services - http://aws.amazon.com/ Microsoft Azure - http://azure.microsoft.com/en-us/ Google App Engine - https://cloud.google.com/appengine/ Illumina BaseSpace - https://basespace.illumina.com IBM Cloud Computing - http://www.ibm.com/cloud-computing/us/en/ HP Eucalyptus - https://www.eucalyptus.com/ HP Cloud - http://www.hpcloud.com/ Rackspace Cloud - http://www.rackspace.com/cloud DigitalOcean https://www.digitalocean.com/ CenturyLink Cloud - https://www.centurylinkcloud.com/ Verizon Cloud - http://cloud.verizon.com/ Computer Sciences Corporation - http://www.csc.com/cloud Virtustream - http://www.virtustream.com/ VMware - http://www.vmware.com/cloud-services/ Fujitsu Cloud - http://www.fujitsu.com/global/solutions/cloud/ Dimension Data Cloud - http://cloud.dimensiondata.com/am/en/ GoGrid - http://www.gogrid.com/ Joyent - https://www.joyent.com/ Lavanya Rishishwar GATech 18

Cloud computing: Performance comparison Ability to execute Completeness of vision Gartner Magic Quadrant of Cloud IaaS, 2014 19

Cloud computing for bioinformatics Basics & need for cloud computing Barriers to use Widely used platforms Amazon Web Services Microsoft Azure Bionimbus Galaxy Google Illumina BaseSpace ADAM 20

21

22

23

24

25

26

January 13, 2016 27

January 13, 2016 28

29

30

ADAM is a genomics analysis platform developed in the Apache Spark ecosystem. It uses the in-memory cluster computing functionality of Apache Spark, ensuring efficient and fault-tolerant distribution based on data parallelism, without the intermediate disk operations required in classical distributed approaches. 31

January 13, 2016 32

MapReduce Framework with Hadoop http://hadoop.apache.org [More from Ahsan Huda] 33

Hadoop Framework Hadoop Distributed File System (HDFS): Fault-tolerant distributed file system to use a cluster of servers as a scalable pool of storage. Hadoop YARN: Open source resource management platform for computing resource allocation in clusters and scheduling Hadoop MapReduce: Batch-processing tool for big data Higher-lever languages over Hadoop: Pig and Hive

Hadoop MapReduce vs Spark Hadoop MapReduce: Involves a lot of data I/O on the hard disk after a map or reduce action Can handle data that fits the hard disk Spark: Performs in-memory processing of the data Can handle data that fits the memory https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

Do NOT use MapReduce if Keep in mind that MapReduce is designed for big data, so if your data is not THAT big, that is If your data is ~10GB, your laptop is likely to have enough ram to handle all of it If your data is ~500GB-1TB, an external hard drive plus some SQL should handle it nicely Also, you should keep in mind that MapReduce is great for key-value pairs, and it will make your life miserable if you try to use MapReduce and Your computation depends on previously computed values Your algorithms depends on shared global state