High Performance and Cloud Computing (HPCC) for Bioinformatics

High Performance and Cloud Computing (HPCC) for Bioinformatics King Jordan Georgia Tech January 13, 2016 Adopted From BIOS-ICGEB HPCC for Bioinformatics 1

Outline High performance computing (HPC) Cloud computing HPC vs. Cloud computing Cloud computing for bioinformatics 2

HPC Overview: Client-server architecture 3

HPC Overview: Supercomputer clusters A computer cluster is a single logical unit consisting of multiple computers that are linked through a local area network (LAN). The networked computers essentially act as a single, much more powerful machine. A computer cluster provides much faster processing speed, larger storage capacity, better data integrity, superior reliability and wider availability of resources. Computer clusters are, however, much more costly to implement and maintain. This results in much higher running overhead compared to a single computer. (This is where cloud computing comes in ) http://www.techopedia.com/definition/6581/computer-cluster 4

HPC Overview: Parallel computing Parallel computing is a type of computing architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing large computations by dividing the workload between more than one processor, all of which work through the computation at the same time. Most supercomputers employ parallel computing principles to operate. Parallel computing is also known as parallel processing. http://www.techopedia.com/definition/8777/parallel-computing 5

HPC @ GA Tech: PACE (Partnership for an Advanced Computing Environment) 1,200 nodes with 30,000 CPU cores 90 terabytes of memory 2 Petabytes of online commodity storage 215 terabytes of high-performance scratch storage 6

What is Cloud Computing? How is it related to HPC? How does it differ from traditional HPC? 7

What is Cloud Computing (skeptical view) "The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do. I can't think of anything that isn't cloud computing with all of these announcements. The computer industry is the only industry that is more fashion-driven than women's fashion. Maybe I'm an idiot, but I have no idea what anyone is talking about. What is it? It's complete gibberish. It's insane. When is this idiocy going to stop?" Larry Ellison, CEO Oracle, OracleWorld 2008 https://www.youtube.com/watch?v=0facyai6dy0 Paul Hodor B A H 8

Moving towards a more specific definition of Cloud Computing In 2011 the National Institute of Standards and Technology (NIST) issued Special Publication 800-145, "The NIST definition of cloud computing Intended as a means for broad comparisons of cloud services and deployment strategies to provide a baseline for discussion on what cloud computing is and how it is used Defines the following categories of concepts Essential characteristics Service models Deployment models Paul Hodor B A H 9

Essential characteristics of cloud computing (NIST) On-demand self-service Broad network access Resource pooling Rapid elasticity Measured service Paul Hodor B A H 10

Service models of Cloud Computing (NIST) Software as a Service (SaaS) The capability to use the provider's applications remotely over the network. The user does not manage the server, operating system, storage, even application capabilities. Platform as a Service (PaaS) The capability to deploy and use user-created or acquired applications on infrastructure made available by the provider. The user has control over deployed applications and their configuration, but does not manage servers, operating system, or storage. Infrastructure as a Service (IaaS) The capability to provision computing resources, storage networking, on which to deploy arbitrary software. The user has virtual control over all resources, but does not have control over the physical infrastructure. Paul Hodor B A H 11

Service models of Cloud Computing (NIST) Private cloud Community cloud Public cloud Hybrid cloud Paul Hodor B A H 12

Cloud Computing can also be considered as a kind of Commodity Computing Use of large numbers of already-available computing components for parallel computing, to get the greatest amount of useful computation at low cost. Computing done in commodity computers as opposed to high-cost supercomputers or boutique computers Commodity computers are computer systems manufactured by multiple vendors, incorporating components based on open standards Such systems are said to be based on commodity components, since the standardization process promotes lower costs and less differentiation among vendors' products http://en.wikipedia.org/wiki/commodity_computing 13

Cloud Computing was made possible by the convergence of three existing technologies The internet Research on packet networking funded in the 1960s TCP/IP introduced in the 1980s Opening to commercial traffic 1990-1995 Virtualization Early work by IBM in the 1960s Hardware virtualization becomes mainstream in the early 2000s Parallel computing First multi- processor computers in the 1960s Birth of the Message Passing Interface (MPI) in 1992 MapReduce paper published in 2004 Paul Hodor B A H 14

HPC versus Cloud Computing Models Traditional HPC model (Physical data center) Buy a bunch of server boxes Add hard drives for storage Connect servers with cables into an intranet Install an operating system and applications Log in remotely and start working ssh user@mydomain.com Cloud Computing model (Virtual data center) Provision a bunch of instances Attach virtual volumes for storage Create a virtual private cloud Launch a machine image Log in remotely and start working ssh user@mydomain.com Paul Hodor B A H 16

Cloud computing: Available platforms Lavanya Rishishwar GATech 17

Cloud computing: Available platforms Amazon Web Services - http://aws.amazon.com/ Microsoft Azure - http://azure.microsoft.com/en-us/ Google App Engine - https://cloud.google.com/appengine/ Illumina BaseSpace - https://basespace.illumina.com IBM Cloud Computing - http://www.ibm.com/cloud-computing/us/en/ HP Eucalyptus - https://www.eucalyptus.com/ HP Cloud - http://www.hpcloud.com/ Rackspace Cloud - http://www.rackspace.com/cloud DigitalOcean https://www.digitalocean.com/ CenturyLink Cloud - https://www.centurylinkcloud.com/ Verizon Cloud - http://cloud.verizon.com/ Computer Sciences Corporation - http://www.csc.com/cloud Virtustream - http://www.virtustream.com/ VMware - http://www.vmware.com/cloud-services/ Fujitsu Cloud - http://www.fujitsu.com/global/solutions/cloud/ Dimension Data Cloud - http://cloud.dimensiondata.com/am/en/ GoGrid - http://www.gogrid.com/ Joyent - https://www.joyent.com/ Lavanya Rishishwar GATech 18

Cloud computing: Performance comparison Ability to execute Completeness of vision Gartner Magic Quadrant of Cloud IaaS, 2014 19

Cloud computing for bioinformatics Basics & need for cloud computing Barriers to use Widely used platforms Amazon Web Services Microsoft Azure Bionimbus Galaxy Google Illumina BaseSpace ADAM 20

January 13, 2016 27

January 13, 2016 28

ADAM is a genomics analysis platform developed in the Apache Spark ecosystem. It uses the in-memory cluster computing functionality of Apache Spark, ensuring efficient and fault-tolerant distribution based on data parallelism, without the intermediate disk operations required in classical distributed approaches. 31

January 13, 2016 32

MapReduce Framework with Hadoop http://hadoop.apache.org [More from Ahsan Huda] 33

Hadoop Framework Hadoop Distributed File System (HDFS): Fault-tolerant distributed file system to use a cluster of servers as a scalable pool of storage. Hadoop YARN: Open source resource management platform for computing resource allocation in clusters and scheduling Hadoop MapReduce: Batch-processing tool for big data Higher-lever languages over Hadoop: Pig and Hive

Hadoop MapReduce vs Spark Hadoop MapReduce: Involves a lot of data I/O on the hard disk after a map or reduce action Can handle data that fits the hard disk Spark: Performs in-memory processing of the data Can handle data that fits the memory https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

Do NOT use MapReduce if Keep in mind that MapReduce is designed for big data, so if your data is not THAT big, that is If your data is ~10GB, your laptop is likely to have enough ram to handle all of it If your data is ~500GB-1TB, an external hard drive plus some SQL should handle it nicely Also, you should keep in mind that MapReduce is great for key-value pairs, and it will make your life miserable if you try to use MapReduce and Your computation depends on previously computed values Your algorithms depends on shared global state