Using the Open Science Data Cloud for Data Science Research. Robert Grossman University of Chicago Open Cloud Consor=um June 17, PDF Free Download

Using the Open Science Data Cloud for Data Science Research Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013

Discoveries Team: you and your colleagues correla=on + algorithms + Instrument: 3000 cores / 5 PB OSDC science cloud + Data: 1 PB of OSDC data across several disciplines

Part 1 What Instrument Do we Use to Make Big Data Discoveries? How do we build a datascope?

What is big data? W? KW? MW? TB? PB? EB?

An algorithm and compu=ng infrastructure is big- data scalable if adding a rack (or container) of data (and corresponding processors) allows you to do the same computa=on in the same =me but over more data.

Commercial Cloud Service Provider (CSP) 15 MW Data Center Monitoring, network security and forensics Automa=c provisioning and infrastructure management Accoun=ng and billing 100,000 servers 1 PB DRAM 100 s of PB of disk Customer Facing Portal ~1 Tbps egress bandwidth 25 operators for 15 MW Commercial Cloud Data center network

OSDC s vote for a datascope: a (bou=que) data center scale facility with a big- data scalable analy=c infrastructure.

Discoveries Team: you and your colleagues correla=on + algorithms + Instrument: 3000 cores / 5 PB OSDC science cloud + Data: 1 PB of OSDC data across several disciplines

Some Examples of Big Data Science Discipline Dura2on Size # Devices HEP - LHC 10 years 15 PB/year* One Astronomy - LSST 10 years 12 PB/year** One Genomics - NGS 2-4 years 0.5 TB/genome 1000 s *At full capacity, the Large Hadron Collider (LHC), the world's largest par=cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. This ambi=ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hhp://press.web.cern.ch/public/en/spotlight/spotlightgrid_081008- en.html **As it carries out its 10- year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul=ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hhp://www.lsst.org/ News/enews/teragrid- 1004.html

One large instrument Many smaller instruments

Part 2. What is a Cloud and Why Do We Care? 11

There Are Two Essen=al Characteris=cs of a Cloud 1. Self service 2. Scale Clouds enable you to compute over large amounts of data with the necessity of first downloading the data. Clouds can be designed to be secure and compliant. 12

Self Service Self Service 13

Scale 14

Types of Clouds Public Clouds Amazon Private Clouds Run internally by universi=es or companies Community Clouds Run by organiza=ons (either formally or informally), such as the Open Cloud Consor=um 15

vs. Amazon Web Services (AWS)? Scale Simplicity of a credit card Wide variety of offerings. Community clouds, science clouds, etc. Lower cost (at medium scale) Data too important for commercial cloud Compu=ng over scien=fic data is a core competency Can support any required governance / security OCC supports AWS interop and burs=ng when permissible. 16

POV Data & Storage NFP Science Clouds Democra=ze access to data. Integrate data to make discoveries. Long term archive. Data intensive compu=ng & HP storage Science Clouds Commercial Clouds As long as you pay the bill; as long as the business model holds. Internet style scale out and object- based storage Flows Large & small data flows Lots of small web flows Streams Streaming processing required NA Accoun=ng Essen=al Essen=al Lock in Moving environment between CSPs essen=al Lock in is good Interop Cri=cal, but difficult Customers will drive to some degree 17

Essen=al Services for a Science CSP Support for data intensive compu=ng Support for big data flows Account management, authen=ca=on and authoriza=on services Health and status monitoring Billing and accoun=ng Ability to rapidly provision infrastructure Security services, logging, event repor=ng Access to large amounts of public data High performance storage Simple data export and import services

Sci CSP services Data scien=st Datascope Science Cloud Service Provider (Sci CSP)

Cloud Services Opera=ons Centers (CSOC) The OSDC operates Cloud Services Opera=ons Center (or CSOC). It is a CSOC focused on suppor=ng Science Clouds for researchers. Compare to Network Opera=ons Center or NOC. Both are an important part of cyber infrastructure for big data science.

Sci CSP services Data scien=st Datascope Science Cloud Service Provider (Sci CSP) Cloud Service Opera=ons Center (CSOC)

Part 3 Data Science

Establish best prac=ces, strategies for data science in general and discipline specific data science in par=cular Models and algorithms Data General and discipline specific souware applica=ons and tools Data Analy=c infrastructure Founda=ons of data science

What are the founda=ons for data science?

Theory to Big Data Spectrum Mathema=cal theorems Tradi=onal sta=s=cal modeling (Semi- )Automa=ng sta=s=cal modeling Simple counts and sta=s=cs over big data No data Small data Medium data GB TB PB OSDC Datascope 0.5-2.0 MW Big data

Part 4 The Open Science Data Cloud www.opensciencedatacloud.org

Discoveries Team: you and your colleagues correla=on + algorithms + Instrument: 3000 cores / 5 PB OSDC science cloud + Data: 1 PB of OSDC data across several disciplines

2013 Open Science Data Cloud (IaaS) Compliance, & security (OpenFISMA) Infrastructure automa=on & management (Yates) Accoun=ng & billing (Salesforce.com) Science Cloud SW & Services 5 PB 2013 (OpenStack & GlusterFS) Customer Facing Portal (Tukey) ~10-100 Gbps bandwidth 5 engineers to operate 0.5 MW Science Cloud Data center network Virtual Machine (VM) containing common applica=ons & pipelines Tukey (OSDC portal & middleware v0.3) Yates (infrastructure automa=on and management v0.1) 28

Tukey Tukey (based in part on Horizon). We have factored out digital ID service, file sharing, and transport from Bionimbus and Matsu.

Yates Automa=on installa=on of OSDC souware stack on rack of computers. Based upon Chef Version 0.1

UDR UDT is a high performance network transport protocol UDR = rsync + UDT It is easy for an average systems administrator to keep 100 s of TB of distributed data synchronized. We are using it to distribute c. 1 PB from the OSDC

Open Science Data Cloud Services Digital ID services Data sharing services Data transport services (UDR) What other core services are essen&al? Of course, working groups and applica=ons always add their own services These core services will hopefully make the OSDC ahrac=ve as a plaxorm (PaaS) for scien=fic discovery.

U.S based not- for- profit corpora=on. Manages cloud compu=ng infrastructure to support scien=fic research: Open Science Data Cloud. Manages cloud compu=ng infrastructure to support medical and health care research: Biomedical Commons Cloud Manages cloud compu=ng testbeds: Open Cloud Testbed. www.opencloudconsor=um.org 33

OCC Members & Partners Companies: Cisco, Yahoo!, Intel, Universi=es: University of Chicago, Northwestern Univ., Johns Hopkins, Calit2, ORNL, University of Illinois at Chicago, Federal agencies and labs: NASA Interna=onal Partners: Univ. Edinburgh, AIST (Japan), Univ. Amsterdam, Partners: Na=onal Lambda Rail 34

Tukey Yates + + Third party open source souware Open source souware developed by the OCC and open standards Data center + + + Data with permissions Authoriza=on of users access to data Policies, procedures, controls, etc. + + Governance, legal agreements Sustainability model 35

Part 5 OSDC Data

Discoveries Team: you and your colleagues correla=on + algorithms + Instrument: 3000 cores / 5 PB OSDC science cloud + Data: 1 PB of OSDC data across several disciplines

OSDC Public Data Sets Over 800 TB of open access data in the OSDC Earth sciences data Biological sciences data Social sciences data Digital humani=es

Part 6 OSDC Working Groups Just look around you

Matsu Working Group: Clouds to Support Earth Science matsu.opensciencedatacloud.org 41

Analy=c Services NoSQL- based Analy=c Services Matsu Architecture Storage for WMS =les and derived data products NoSQL Database Presenta=on Services Matsu Web Map Tile Service (WMTS) Images at different zoom layers suitable for OGC Web Mapping Server Workflow Services MR- based Analy=c Services Streaming Analy=c Services Matsu MR- based Tiling Service MapReduce used to process Level n to Level n+1 data and to par==on images for different zoom levels Hadoop HDFS Level 0, Level 1 and Level 2 images Web Coverage Processing Service (WCPS)

Hadoop- Based Re- Analysis Zoom Level 1: 4 images Zoom Level 2: 16 images Zoom Level 3: 64 images Zoom Level 4: 256 images

Bionimbus Working Group bionimbus.opensciencedatacloud.org (biological data)

Bionimbus Protected Data Cloud 45

Analyzing Data From The Cancer Genome Atlas (TCGA) Current Prac2ce 1. Apply to dbgap for access to data. 2. Hire staff, set up and operate secure compliant compu=ng environment to mange 10 100+ TB of data. 3. Get environment approved by your research center. 4. Setup analysis pipelines. 5. Download data from CG- Hub (takes days to weeks). 6. Begin analysis. With Protected Data Cloud (PDC) 1. Apply to dbgap for access to data. 2. Use your era commons creden=als to login to the PDC, select the data that you want to analyze, and the pipelines that you want to use. 3. Begin analysis. 46

One Million Genomes Sequencing a million genomes would most likely fundamentally change the way we understand genomic varia=on. The genomic data for a pa=ent is about 1 TB (including samples from both tumor and normal =ssue). One million genomes is about 1000 PB or 1 EB With compression, it may be about 100 PB At $1000/genome, the sequencing would cost about $1B

Big data driven discovery on 1,000,000 genomes and 1 EB of data. Genomic- driven diagnosis Improved understanding of genomic science Genomic- driven drug development Precision diagnosis and treatment. Preven=ve health care.

Biomedical Commons Cloud (BCC) Working Group Medical Research Center C Medical Research Center A Cloud for Public Data Cloud for Controlled Genomic Data Medical Research Center B Cloud for EMR, PHI, data Example: Open Cloud Consor=um s Biomedical Commons Cloud (BCC) Hospital D 49

Resource Who users Who operates Open Science Data Cloud (OSDC) Biomedical Commons Clouds (BCC) Bionimbus Protected Data Cloud Pan science data for researchers (Interna=onal) biomedical researchers Genomics researchers Open Cloud Consor=um (OCC) supported by University OCC members OCC Biomedical Commons Cloud Working Group supported by OCC University members University of Chicago supported by the OCC 50

OpenFlow- Enabled Hadoop WG When running Hadoop some map and reduce jobs take significantly longer than others. These are stragglers and can significantly slow down a MapReduce computa=on. Stragglers are common (dirty secret about Hadoop) Infoblox and UChicago are leading a OCC Working Group on OpenFlow- enabled Hadoop that will provide addi=onal bandwidth to stragglers. We have a testbed for a wide area version of this project.

OSDC PIRE Project We select OSDC PIRE Fellows (US ci=zens or permanent residents): We give them tutorials and training on big data science. We provide them fellowships to work with OSDC interna=onal partners. We give them preferred access to the OSDC. Nominate your favorite scien=st as an OSDC PIRE Fellow. www.opensciencedatacloud.org (look for PIRE)

Part 7 Key Ques=ons for This Workshop

Ques=on 1. How can we add partner sites at other loca=ons that extend the OSDC? In par=cular, how can we extend the OSDC to sites around the world? How can the OSDC interoperate with other science clouds? Ques=on 2. What data can we add to the OSDC to facilitate data intensive cross- disciplinary discoveries? Ques=on 3. How can we build a plugin structure so that Tukey can be extended by other users and by other communi=es? Ques=on 4. What tools and applica=ons can we add to the OSDC facilitate data intensive cross- disciplinary discoveries? Ques=on 5. How can we beher integrate digital IDs and file sharing services into the OSDC? Ques=on 6. What are 3-5 grand challenge ques=ons that leverage the OSDC?

Ques=ons

Robert Grossman is a faculty member at the University of Chicago. He is the Chief Research Informa=cs Officer for the Biological Sciences Division, a Faculty Member and Senior Fellow at the Computa=on Ins=tute and the Ins=tute for Genomics and Systems Biology, and a Professor of Medicine in the Sec=on of Gene=c Medicine. His research group focuses on big data, biomedical informa=cs, data science, cloud compu=ng, and related areas. He is also the Founder and a Partner of Open Data Group, which has been building predic=ve models over big data for companies for over ten years. He recently wrote a book for the general reader that discusses big data (among other topics) called the Structure of Digital Compu=ng: From Mainframes to Big Data, which can be purchased from Amazon. He blogs occasionally about big data at rgrossman.com.

Using the Open Science Data Cloud for Data Science Research. Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013