XLDB 11 Cloud Computing at Scale. Roger Barga Microsoft Research

Similar documents
escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

escience in the Cloud Dan Fay Director Earth, Energy and Environment



Scaling Science to the Cloud. Tony Hey Corporate Vice President Microsoft Corporation

Example #1: Gene Sequencing

Chart 2: e-waste Processed by SRD Program in Unregulated States

Performing Large Science Experiments on Azure: Pitfalls and Solutions

US STATE CONNECTIVITY

Manufactured Home Production by Product Mix ( )

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

The CEDA Archive: Data, Services and Infrastructure

Commercial Data Intensive Cloud Computing Architecture: A Decision Support Framework

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

IACMI - The Composites Institute

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and

Alaska no no all drivers primary. Arizona no no no not applicable. primary: texting by all drivers but younger than

Introduction to Grid Computing

Terry McAuliffe-VA. Scott Walker-WI

The Promise of Brown v. Board Not Yet Realized The Economic Necessity to Deliver on the Promise

Programmable Information Highway (with no Traffic Jams)

Automatic Generation of Workflow Provenance

4/25/2013. Bevan Erickson VP, Marketing

Reporting Child Abuse Numbers by State

Oklahoma Economic Outlook 2016

IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

New Approach to Unstructured Data

Image Processing on the Cloud. Outline

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Data Intensive Science Impact on Networks

Optimizing Apache Spark with Memory1. July Page 1 of 14

MapMarker Plus 10.2 Release Notes

Question by: Scott Primeau. Date: 20 December User Accounts 2010 Dec 20. Is an account unique to a business record or to a filer?

ICN for Cloud Networking. Lotfi Benmohamed Advanced Network Technologies Division NIST Information Technology Laboratory

Discover the all-flash storage company for the on-demand world

What's Next for Clean Water Act Jurisdiction

The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists

The National Fusion Collaboratory

Wireless Network Data Speeds Improve but Not Incidence of Data Problems, J.D. Power Finds

Florida PALM Project Update Agenda

T2C2. A timely and trusted curator and coordinator of scientific data

MapMarker Standard 10.0 Release Notes

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Power-Aware Throughput Control for Database Management Systems

Knowledge Discovery in Data Bases

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

Zhengyang Liu University of Virginia. Oct 29, 2012

Oklahoma Economic Outlook 2015

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

IBM Emulex 16Gb Fibre Channel HBA Evaluation

Data publication and discovery with Globus

How Employers Use E-Response Date: April 26th, 2016 Version: 6.51

Arizona does not currently have this ability, nor is it part of the new system in development.

Clouds: An Opportunity for Scientific Applications?

Reflections on Three Decades in Internet Time

Forget about the Clouds, Shoot for the MOON

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Alaska ATU 1 $13.85 $4.27 $ $ Tandem Switching $ Termination

DDN Annual High Performance Computing Trends Survey Reveals Rising Deployment of Flash Tiers & Private/Hybrid Clouds vs.

SAS Visual Analytics 8.1: Getting Started with Analytical Models

A Perspective on MGI Networking

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE

Improving Network Infrastructure to Enable Large Scale Scientific Data Flows and Collaboration (Award # ) Klara Jelinkova Joseph Ghobrial

Big Data: Tremendous challenges, great solutions

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

RESEARCH DATA DEPOT AT PURDUE UNIVERSITY

Building Effective CyberGIS: FutureGrid. Marlon Pierce, Geoffrey Fox Indiana University

Benefits of full TCP/IP offload in the NFS

Crop Progress. Corn Emerged - Selected States [These 18 States planted 92% of the 2016 corn acreage]

U.S. Residential High Speed Internet

CS Project Report

Alaska ATU 1 $13.85 $4.27 $ $ Tandem Switching $ Termination

Distributed File Systems Part IV. Hierarchical Mass Storage Systems

Software + Services for Data Storage, Management, Discovery, and Re-Use

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

LAB #6: DATA HANDING AND MANIPULATION

2013 AWS Worldwide Public Sector Summit Washington, D.C.

AGILE BUSINESS MEDIA, LLC 500 E. Washington St. Established 2002 North Attleboro, MA Issues Per Year: 12 (412)

GPFS for Life Sciences at NERSC

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

On BigFix Performance: Disk is King. How to get your infrastructure right the first time! Case Study: IBM Cloud Development - WW IT Services

Galaxy. Daniel Blankenberg The Galaxy Team

Data Centres in the Virtual Observatory Age

The DataBridge: A Social Network for Long Tail Science Data!

Advanced LabVIEW for FTC

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Hadoop/MapReduce Computing Paradigm

MOHA: Many-Task Computing Framework on Hadoop

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION

Infrastructure Innovation Opportunities Y Combinator 2013

Levels of Measurement. Data classing principles and methods. Nominal. Ordinal. Interval. Ratio. Nominal: Categorical measure [e.g.

Genomics on Cisco Metacloud + SwiftStack

Cisco Tetration Analytics

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Crop Progress. Corn Dough Selected States [These 18 States planted 92% of the 2017 corn acreage] Corn Dented Selected States ISSN:

LONI Update. ULS CIO Meeting. Lonnie Leger, LONI / LSU

BEST BIG DATA CERTIFICATIONS

Transcription:

XLDB 11 Cloud Computing at Scale Roger Barga Microsoft Research

Framing Questions for Presentation(s) Does it make sense for large-scale (many terabytes, petabytes), data-intensive projects to consider using public cloud? Many of the claims being made about cloud computing for research have lead some to the point of irrational exuberance and unrealistic expectations Issues and observations on when public clouds are not a great fit What is the reasonable scale, and how to decide when it is better to use a public cloud or a cluster? Realistic expectations and opportunities to create sources of value. 2

China Selected Projects Penn Louisiana Washington New York New Mexico North Dakota California Colorado Michigan Texas National Science Foundation Florida Georgia Mass. Virginia North Carolina South Carolina Indiana Delaware Taiwan- starting

Economics of Cloud Computing It is hard to compete with a fully utilized cluster on cost alone: Cost comparison for DOE lab performing metagenomics, assuming cluster utilization of 80% or higher, private cluster (container) was a clear winner; Possible to fine tune a cluster for a specific workload, further increasing the amount of work performed per dollar spent. Most organizations do not have to fully account for the cost of running a cluster or compute facility, with the institution carrying cost for infrastructure, electricity, etc 163,840 cores ANL ~130,000 cores LANL 150,152 cores ORNL

Economics of Cloud Storage $44.56 $1,250 Hard Drive Storage (per gigabyte) Web Storage (per gigabyte) 2000 Cost prohibitive for petascale projects (and likely to remain so for the foreseeable future )

Economics of Cloud Storage BackBlaze Blog Sept 2009 $0.07 $0.15 Hard Drive Storage (per gigabyte) Web Storage (per gigabyte) 2010 Cost prohibitive for petascale projects (and likely to remain so for the foreseeable future )

Realities of Networking both to and within Cloud Platforms Latency and throughput from lab to cloud (datacenter) limited by weakest link. Measure the time to transfer a terabyte, even on research networks (NLR and/or I2); Groups facing issues scaling a compute facility unable to get sufficient throughput; Limited network bandwidth within the datacenter, largely commodity hardware Simply put, data intensive research applications were not the primary target customer for first generation cloud providers

Lack of Broad Access how the other half live High Performance Data-intensive Capacity 1M 14M 80% 20% 55M 70M Little to no access to high performance data-intensive capacity Scientists & Engineers Actually the other 80%, as only a fraction of scientists and researchers enjoy regular access to required computational and data-intensive resources

What Researchers Need Today, and Trends Looking Ahead Data Analysis Services I ve got data, what can I do with it? Most researchers struggle with 50GB to 100GB for their research; Analysis can be compute intensive, intermediate data products large; Research projects are more diverse, more collaborative Push towards community reference data collections, up to 5TB to 10TB; Shared, community built and supported research services; Software services are becoming a new type of instrument (A Szalay) Analysis involves personal research data plus community reference data; Ability to curate, share, and collaborate over data is game changing; A common pattern across many scientific disciplines

R. palustris as a platform for H2 production * Identify key drivers for producing hydrogen, promising alternative fuel understand R. palustris well enough to be able to improve its H2 production; Characterize a population of strains and use integrative genomics approaches to dissect the molecular networks of H2 production; BLAST to query 16 strains to sort out genetic relationships Each strain, estimated ~5,000 proteins (700,000 sequence) Jobs kicked off of NCBI clusters before completion Using AzureBLAST against NCBI NREF complete in ~30 min. Against ~5,000 proteins from another strain less than 30 sec. Publishable result in one day for roughly $150. NCBI reference data for complete BLAST suite, roughly 50GB Eric Schadt, Pac Bio and Sam Phattarasukol Harwood Lab, UW

AzureMODIS Computing Evapotranspiration (ETP) Catharine van Ingen (MSR), Jie Li, Marty Humphrey (UVA), Youngryel Ryu (UCB), Deb Agarwal (BWC/LBL), Keith Jackson (BL), Jay Borenstein (Stanford), Team SICT: Vlad Andrei, Klaus Ganser, Samir Selman, Nandita Prabhu (Stanford), Team Nimbus: David Li, Sudarshan Rangarajan, Shantanu Kurhekar, Riddhi Mittal (Stanford) Two MODIS satellites Terra, launched 12/1999 Aqua, launched 05/2002 Near polar orbits Global coverage two days Sensitive in 36 spectral bands

Computing Evapotranspiration for One US Year Source Imagery Download Sites... Request Queue $50 upload $225 storage Download Queue Data Collection Stage 400-500 GB 60K files 10 MB/sec 11 hours <10 workers Reprojection Queue Source Metadata External $ 1200 Internal $ 600 Scientific Results Download Reprojection Stage $420 cpu $60 download 400 GB 45K files 3500 hours Reduction #1 Queue Derivation Reduction Stage 5-7 GB 5.5K files $216 cpu 1800 hours 100 workers $216 cpu Reduction #2 Queue Analysis Reduction Stage <10 GB ~15K files 1800 hours Total cost $1,800, where all storage costs assume 3 month project duration;

Excel DataScope Cloud Scale Data Analytics from Excel Bringing the power of the cloud to the laptop Data sharing in the cloud, with annotations to facilitate discovery and reuse; Sample and manipulate extremely large data collections in the cloud; Data analytics algorithms, through Excel ribbon running on Azure; Invoke models, to process and refine data; Machine learning over large data sets to discover correlations; Publish data collections and visualizations to the cloud; Researchers use familiar tools, familiar but differentiated.

The Cloud Opportunity: Democratizing Access to Resources