Using the Open Science Data Cloud for Data Science Research. Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013

Similar documents
Bionimbus: Lessons from a Petabyte-Scale Science Cloud Service Provider (CSP)

Florida International University

EOSC Services & Architecture: the EOSC-hub approach Tiziana Ferrari, Project Coordinator, EGI Founda?on

CANARIE: Providing Essential Digital Infrastructure for Canada

Image Processing on the Cloud. Outline

globus online The Galaxy Project and Globus Online

The Bionimbus PDC: Obtaining Access FAQ

TPP On The Cloud. Joe Slagel

Towards a Strategy for Data Sciences at UW

Rutgers Discovery Informatics Institute (RDI2)

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

LSST: Crea*ng a Digital Universe

CLOUD SERVICES. Cloud Value Assessment.

UNIT-II : VIRTUALIZATION & COMMON STANDARDS IN CLOUD COMPUTING

Developing an Analy.cs Dashboard for Coursera MOOC Discussion Forums CNI Fall 2014 Membership Mee.ng

The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research

EBOOK: Backup & Recovery on AWS

Please give me your feedback

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

2013 AWS Worldwide Public Sector Summit Washington, D.C.

REsources linkage for E-scIence - RENKEI -

UAB IT Research Compu3ng Update

Institute of Cybernetics NAS of Ukraine Valentyna Cherepynets

Big Data, Big Compute, Big Interac3on Machines for Future Biology. Rick Stevens. Argonne Na3onal Laboratory The University of Chicago

On-demand Research Computing: the European Grid Infrastructure

What s New at AWS? looking at just a few new things for Enterprise. Philipp Behre, Enterprise Solutions Architect, Amazon Web Services

IRODS USER GROUP 2014 CAMBRIDGE,MA John Burns. 6/25/14 Archive Analy3cs Solu3ons 1

Cisco CloudCenter Use Case Summary

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS

MapReduce, Apache Hadoop

The Canadian CyberSKA Project

Information Technology Infrastructure Committee (ITIC)

EGI: Linking digital resources across Eastern Europe for European science and innovation

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Travelling securely on the Grid to the origin of the Universe

AWS Iden)ty And Access Management (IAM) Manohar Rapolu

Title DC Automation: It s a MARVEL!

Business Case Components

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

5 Fundamental Strategies for Building a Data-centered Data Center

EMC ISILON HARDWARE PLATFORM

Tools for Handling Big Data and Compu5ng Demands. Humani5es and Social Science Scholars

SKA Computing and Software

Horizont HPE Synergy. Matt Foley, EMEA Hybrid IT Presales. October Copyright 2015 Hewlett Packard Enterprise Development LP

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

January 2011 Joint ISACA/IIA Mee5ng

Cloud Computing WSU Dr. Bahman Javadi. School of Computing, Engineering and Mathematics

MapReduce, Apache Hadoop

CloudLab. Updated: 5/24/16

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

PhD in Computer And Control Engineering XXVII cycle. Torino February 27th, 2015.

The OpenCirrus TM Project: A global Testbed for Cloud Computing R&D

Georgia State University Cyberinfrastructure Plan

Database Management Systems

Accelerate your Azure Hybrid Cloud Business with HPE. Ken Won, HPE Director, Cloud Product Marketing

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

igeni: International Global Environment for Network Innovations

The Fermilab HEPCloud Facility: Adding 60,000 Cores for Science! Burt Holzman, for the Fermilab HEPCloud Team HTCondor Week 2016 May 19, 2016

GENI Laboratory Exercises for a Cloud Computing course

CDIS Biomedical Data Commons

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Op#mizing MapReduce for Highly- Distributed Environments

Private Cloud at IIT Delhi

6,000 Cameras in Time Square 210 million Cameras worldwide

De BiG Grid e-infrastructuur digitaal onderzoek verbonden

Yajing (Phillis)Tang. Walt Wells

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

e-infrastructures in FP7 INFO DAY - Paris

Vision of the Software Defined Data Center (SDDC)

Maimonides Medical Center s Quest for Operational Continuity Via Real-Time Data Accessibility

Grid Computing a new tool for science

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Azure Certification BootCamp for Exam (Developer)

The Human Variant Database

Demystifying the Cloud With a Look at Hybrid Hosting and OpenStack

globus online Software-as-a-Service for Research Data Management

LHC and LSST Use Cases

Real- &me Archiving of Spontaneous Events (Use- Case : Hurricane Sandy)

BUSINESS DATA LAKE FADI FAKHOURI, SR. SYSTEMS ENGINEER, ISILON SPECIALIST. Copyright 2016 EMC Corporation. All rights reserved.

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Welcome to the SIHO itransact portal.

Europe and its Open Science Cloud: the Italian perspective. Luciano Gaido Plan-E meeting, Poznan, April

Part 2: Computing and Networking Capacity (for research and instructional activities)

Introduction to the Mathematics of Big Data. Philippe B. Laval

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

Farsight Genome Systems

Cybersecurity Curricular Guidelines

Lesson 14: Cloud Computing

Archiving to The Cloud?

Amazon Web Services. Foundational Services for Research Computing. April Mike Kuentz, WWPS Solutions Architect

Village Software. Security Assessment Report

2017 Resource Allocations Competition Results

ICN for Cloud Networking. Lotfi Benmohamed Advanced Network Technologies Division NIST Information Technology Laboratory

Distributed Research Networks: Lessons from the Field

Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data

Examining Public Cloud Platforms

A VO-friendly, Community-based Authorization Framework

User Community Driven Development in Trust and Identity Services

Transcription:

Using the Open Science Data Cloud for Data Science Research Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013

Discoveries Team: you and your colleagues correla=on + algorithms + Instrument: 3000 cores / 5 PB OSDC science cloud + Data: 1 PB of OSDC data across several disciplines

Part 1 What Instrument Do we Use to Make Big Data Discoveries? How do we build a datascope?

What is big data? W? KW? MW? TB? PB? EB?

An algorithm and compu=ng infrastructure is big- data scalable if adding a rack (or container) of data (and corresponding processors) allows you to do the same computa=on in the same =me but over more data.

Commercial Cloud Service Provider (CSP) 15 MW Data Center Monitoring, network security and forensics Automa=c provisioning and infrastructure management Accoun=ng and billing 100,000 servers 1 PB DRAM 100 s of PB of disk Customer Facing Portal ~1 Tbps egress bandwidth 25 operators for 15 MW Commercial Cloud Data center network

OSDC s vote for a datascope: a (bou=que) data center scale facility with a big- data scalable analy=c infrastructure.

Discoveries Team: you and your colleagues correla=on + algorithms + Instrument: 3000 cores / 5 PB OSDC science cloud + Data: 1 PB of OSDC data across several disciplines

Some Examples of Big Data Science Discipline Dura2on Size # Devices HEP - LHC 10 years 15 PB/year* One Astronomy - LSST 10 years 12 PB/year** One Genomics - NGS 2-4 years 0.5 TB/genome 1000 s *At full capacity, the Large Hadron Collider (LHC), the world's largest par=cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. This ambi=ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hhp://press.web.cern.ch/public/en/spotlight/spotlightgrid_081008- en.html **As it carries out its 10- year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul=ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hhp://www.lsst.org/ News/enews/teragrid- 1004.html

One large instrument Many smaller instruments

Part 2. What is a Cloud and Why Do We Care? 11

There Are Two Essen=al Characteris=cs of a Cloud 1. Self service 2. Scale Clouds enable you to compute over large amounts of data with the necessity of first downloading the data. Clouds can be designed to be secure and compliant. 12

Self Service Self Service 13

Scale 14

Types of Clouds Public Clouds Amazon Private Clouds Run internally by universi=es or companies Community Clouds Run by organiza=ons (either formally or informally), such as the Open Cloud Consor=um 15

vs. Amazon Web Services (AWS)? Scale Simplicity of a credit card Wide variety of offerings. Community clouds, science clouds, etc. Lower cost (at medium scale) Data too important for commercial cloud Compu=ng over scien=fic data is a core competency Can support any required governance / security OCC supports AWS interop and burs=ng when permissible. 16

POV Data & Storage NFP Science Clouds Democra=ze access to data. Integrate data to make discoveries. Long term archive. Data intensive compu=ng & HP storage Science Clouds Commercial Clouds As long as you pay the bill; as long as the business model holds. Internet style scale out and object- based storage Flows Large & small data flows Lots of small web flows Streams Streaming processing required NA Accoun=ng Essen=al Essen=al Lock in Moving environment between CSPs essen=al Lock in is good Interop Cri=cal, but difficult Customers will drive to some degree 17

Essen=al Services for a Science CSP Support for data intensive compu=ng Support for big data flows Account management, authen=ca=on and authoriza=on services Health and status monitoring Billing and accoun=ng Ability to rapidly provision infrastructure Security services, logging, event repor=ng Access to large amounts of public data High performance storage Simple data export and import services

Sci CSP services Data scien=st Datascope Science Cloud Service Provider (Sci CSP)

Cloud Services Opera=ons Centers (CSOC) The OSDC operates Cloud Services Opera=ons Center (or CSOC). It is a CSOC focused on suppor=ng Science Clouds for researchers. Compare to Network Opera=ons Center or NOC. Both are an important part of cyber infrastructure for big data science.

Sci CSP services Data scien=st Datascope Science Cloud Service Provider (Sci CSP) Cloud Service Opera=ons Center (CSOC)

Part 3 Data Science

Establish best prac=ces, strategies for data science in general and discipline specific data science in par=cular Models and algorithms Data General and discipline specific souware applica=ons and tools Data Analy=c infrastructure Founda=ons of data science

What are the founda=ons for data science?

Theory to Big Data Spectrum Mathema=cal theorems Tradi=onal sta=s=cal modeling (Semi- )Automa=ng sta=s=cal modeling Simple counts and sta=s=cs over big data No data Small data Medium data GB TB PB OSDC Datascope 0.5-2.0 MW Big data

Part 4 The Open Science Data Cloud www.opensciencedatacloud.org

Discoveries Team: you and your colleagues correla=on + algorithms + Instrument: 3000 cores / 5 PB OSDC science cloud + Data: 1 PB of OSDC data across several disciplines

2013 Open Science Data Cloud (IaaS) Compliance, & security (OpenFISMA) Infrastructure automa=on & management (Yates) Accoun=ng & billing (Salesforce.com) Science Cloud SW & Services 5 PB 2013 (OpenStack & GlusterFS) Customer Facing Portal (Tukey) ~10-100 Gbps bandwidth 5 engineers to operate 0.5 MW Science Cloud Data center network Virtual Machine (VM) containing common applica=ons & pipelines Tukey (OSDC portal & middleware v0.3) Yates (infrastructure automa=on and management v0.1) 28

Tukey Tukey (based in part on Horizon). We have factored out digital ID service, file sharing, and transport from Bionimbus and Matsu.

Yates Automa=on installa=on of OSDC souware stack on rack of computers. Based upon Chef Version 0.1

UDR UDT is a high performance network transport protocol UDR = rsync + UDT It is easy for an average systems administrator to keep 100 s of TB of distributed data synchronized. We are using it to distribute c. 1 PB from the OSDC

Open Science Data Cloud Services Digital ID services Data sharing services Data transport services (UDR) What other core services are essen&al? Of course, working groups and applica=ons always add their own services These core services will hopefully make the OSDC ahrac=ve as a plaxorm (PaaS) for scien=fic discovery.

U.S based not- for- profit corpora=on. Manages cloud compu=ng infrastructure to support scien=fic research: Open Science Data Cloud. Manages cloud compu=ng infrastructure to support medical and health care research: Biomedical Commons Cloud Manages cloud compu=ng testbeds: Open Cloud Testbed. www.opencloudconsor=um.org 33

OCC Members & Partners Companies: Cisco, Yahoo!, Intel, Universi=es: University of Chicago, Northwestern Univ., Johns Hopkins, Calit2, ORNL, University of Illinois at Chicago, Federal agencies and labs: NASA Interna=onal Partners: Univ. Edinburgh, AIST (Japan), Univ. Amsterdam, Partners: Na=onal Lambda Rail 34

Tukey Yates + + Third party open source souware Open source souware developed by the OCC and open standards Data center + + + Data with permissions Authoriza=on of users access to data Policies, procedures, controls, etc. + + Governance, legal agreements Sustainability model 35

Part 5 OSDC Data

Discoveries Team: you and your colleagues correla=on + algorithms + Instrument: 3000 cores / 5 PB OSDC science cloud + Data: 1 PB of OSDC data across several disciplines

OSDC Public Data Sets Over 800 TB of open access data in the OSDC Earth sciences data Biological sciences data Social sciences data Digital humani=es

Part 6 OSDC Working Groups Just look around you

Matsu Working Group: Clouds to Support Earth Science matsu.opensciencedatacloud.org 41

Analy=c Services NoSQL- based Analy=c Services Matsu Architecture Storage for WMS =les and derived data products NoSQL Database Presenta=on Services Matsu Web Map Tile Service (WMTS) Images at different zoom layers suitable for OGC Web Mapping Server Workflow Services MR- based Analy=c Services Streaming Analy=c Services Matsu MR- based Tiling Service MapReduce used to process Level n to Level n+1 data and to par==on images for different zoom levels Hadoop HDFS Level 0, Level 1 and Level 2 images Web Coverage Processing Service (WCPS)

Hadoop- Based Re- Analysis Zoom Level 1: 4 images Zoom Level 2: 16 images Zoom Level 3: 64 images Zoom Level 4: 256 images

Bionimbus Working Group bionimbus.opensciencedatacloud.org (biological data)

Bionimbus Protected Data Cloud 45

Analyzing Data From The Cancer Genome Atlas (TCGA) Current Prac2ce 1. Apply to dbgap for access to data. 2. Hire staff, set up and operate secure compliant compu=ng environment to mange 10 100+ TB of data. 3. Get environment approved by your research center. 4. Setup analysis pipelines. 5. Download data from CG- Hub (takes days to weeks). 6. Begin analysis. With Protected Data Cloud (PDC) 1. Apply to dbgap for access to data. 2. Use your era commons creden=als to login to the PDC, select the data that you want to analyze, and the pipelines that you want to use. 3. Begin analysis. 46

One Million Genomes Sequencing a million genomes would most likely fundamentally change the way we understand genomic varia=on. The genomic data for a pa=ent is about 1 TB (including samples from both tumor and normal =ssue). One million genomes is about 1000 PB or 1 EB With compression, it may be about 100 PB At $1000/genome, the sequencing would cost about $1B

Big data driven discovery on 1,000,000 genomes and 1 EB of data. Genomic- driven diagnosis Improved understanding of genomic science Genomic- driven drug development Precision diagnosis and treatment. Preven=ve health care.

Biomedical Commons Cloud (BCC) Working Group Medical Research Center C Medical Research Center A Cloud for Public Data Cloud for Controlled Genomic Data Medical Research Center B Cloud for EMR, PHI, data Example: Open Cloud Consor=um s Biomedical Commons Cloud (BCC) Hospital D 49

Resource Who users Who operates Open Science Data Cloud (OSDC) Biomedical Commons Clouds (BCC) Bionimbus Protected Data Cloud Pan science data for researchers (Interna=onal) biomedical researchers Genomics researchers Open Cloud Consor=um (OCC) supported by University OCC members OCC Biomedical Commons Cloud Working Group supported by OCC University members University of Chicago supported by the OCC 50

OpenFlow- Enabled Hadoop WG When running Hadoop some map and reduce jobs take significantly longer than others. These are stragglers and can significantly slow down a MapReduce computa=on. Stragglers are common (dirty secret about Hadoop) Infoblox and UChicago are leading a OCC Working Group on OpenFlow- enabled Hadoop that will provide addi=onal bandwidth to stragglers. We have a testbed for a wide area version of this project.

OSDC PIRE Project We select OSDC PIRE Fellows (US ci=zens or permanent residents): We give them tutorials and training on big data science. We provide them fellowships to work with OSDC interna=onal partners. We give them preferred access to the OSDC. Nominate your favorite scien=st as an OSDC PIRE Fellow. www.opensciencedatacloud.org (look for PIRE)

Part 7 Key Ques=ons for This Workshop

Ques=on 1. How can we add partner sites at other loca=ons that extend the OSDC? In par=cular, how can we extend the OSDC to sites around the world? How can the OSDC interoperate with other science clouds? Ques=on 2. What data can we add to the OSDC to facilitate data intensive cross- disciplinary discoveries? Ques=on 3. How can we build a plugin structure so that Tukey can be extended by other users and by other communi=es? Ques=on 4. What tools and applica=ons can we add to the OSDC facilitate data intensive cross- disciplinary discoveries? Ques=on 5. How can we beher integrate digital IDs and file sharing services into the OSDC? Ques=on 6. What are 3-5 grand challenge ques=ons that leverage the OSDC?

Ques=ons

Robert Grossman is a faculty member at the University of Chicago. He is the Chief Research Informa=cs Officer for the Biological Sciences Division, a Faculty Member and Senior Fellow at the Computa=on Ins=tute and the Ins=tute for Genomics and Systems Biology, and a Professor of Medicine in the Sec=on of Gene=c Medicine. His research group focuses on big data, biomedical informa=cs, data science, cloud compu=ng, and related areas. He is also the Founder and a Partner of Open Data Group, which has been building predic=ve models over big data for companies for over ten years. He recently wrote a book for the general reader that discusses big data (among other topics) called the Structure of Digital Compu=ng: From Mainframes to Big Data, which can be purchased from Amazon. He blogs occasionally about big data at rgrossman.com.