Bionimbus: Lessons from a Petabyte-Scale Science Cloud Service Provider (CSP)

Size: px
Start display at page:

Download "Bionimbus: Lessons from a Petabyte-Scale Science Cloud Service Provider (CSP)"

Transcription

1 Bionimbus: Lessons from a Petabyte-Scale Science Cloud Service Provider (CSP) Robert Grossman Institute for Genomics & Systems Biology Center for Research Informatics Computation Institute Department of Medicine University of Chicago & Open Data Group September 11, 2012

2 The OSDC & Bionimbus Teams Open Science Data Cloud (OSDC) Team Matt Greenway, Allison Heath, Ray Powell, Rafael Suarez. Major funding for the OSDC is provided by the Gordon and Betty Moore Foundation. Bionimbus Team Elizabeth Bartom, Casey Brown, Jason Grundstad, David Hanley, Nicolas Negre, Tom Stricker, Matt Slattery, Rebecca Spokony & Kevin White. Bionimbus is a joint project between Laboratory for Advanced Computing & White Lab at the University of Chicago and uses in part OSDC infrastructure.

3 Let s Step Back 20 Years : Petabyte Access & Storage Solutions (PASS) Project for SSC. It developed & benchmarked federated relational, OO DB, object stores, & columnoriented data warehouse solutions at the TB-scale.

4 A picture of Cern s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Creative Commons BY-SA 2.0,

5 Part 1. Genomics as a Big Data Science

6 Source: Lincoln Stein

7 One Million Genomes Sequencing a million genomes would most likely fundamentally change the way we understand genomic variation. The genomic data for a patient is about 1 TB (including samples from both tumor and normal tissue). One million genomes is about 1000 PB or 1 EB With compression, it may be about 100 PB At $1000/genome, the sequencing would cost about $1B

8 Big data driven discovery on 1,000,000 genomes and 1 EB of data. Genomicdriven diagnosis Improved understanding of genomic science Genomicdriven drug development Precision diagnosis and treatment. Preventive health care.

9 ER+ TNBC With genomics, we can stratify diseases and treat each stratum differently. Source: White Lab, University of Chicago.

10 Clonal Evolution of Tumors Tumors evolve temporally and spatially. Source: Mel Greaves & Carlo C. Maley, Clonal evolution in cancer, Nature, Volume 241, pages , 2012.

11 Combinations of Rare Alleles Penetrance High Intermediate Modest Low Very rare alleles causing Mendelian disease rare variants of small effect very hard to identify by genetic means Rare Low-frequency variants with intermediate penetrance Uncommon rare examples of high-penetrance common variants influencing common disease most common variants implicated in common disease by GWA Common Allele frequency Source: Mark McCarthy

12 TCGA Analysis of Lung Cancer Source: The Cancer Genome Atlas Research Network, Comprehensive genomic characterization of squamous cell lung cancers, Nature, 2012, doi: /nature cases of SQCC (lung cancer) Matched tumor & normal Mean of 360 exonic mutations, 323 CNV, & 165 rearrangements per tumor

13 Some Examples of Big Data Science Discipline Duration Size # Devices HEP - LHC 10 years 15 PB/year* One Astronomy - LSST 10 years 12 PB/year** One Genomics - NGS 2-4 years 0.5 TB/genome 1000 s *At full capacity, the Large Hadron Collider (LHC), the world's largest particle accelerator, is expected to produce more than 15 million Gigabytes of data each year. This ambitious project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: **As it carries out its 10-year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resulting in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source:

14 One large instrument Many smaller instruments

15 Part 2. What Instrument Do we Use to Make Big Data Discoveries? How do we build a datascope?

16 TB? PB? EB? ZB? What is big data?

17 Another way: opencompute.org Think of data as big if you measure it in MW, as in Facebook s Pineville Data Center is 30 MW.

18 An algorithm and computing infrastructure is big-data scalable if adding a rack (or container) of data (and corresponding processors) allows you to do the same computation in the same time but over more data.

19 Commercial Cloud Service Provider (CSP) 15 MW Data Center Monitoring, network security and forensics Automatic provisioning and infrastructure management Accounting and billing 100,000 servers 1 PB DRAM 100 s of PB of disk Customer Facing Portal ~1 Tbps egress bandwidth 25 operators for 15 MW Commercial Cloud Data center network

20 What are some of the important differences between commercial and research-focused CSPs?

21 POV Data & Storage Flows Streams Science CSP Democratize access to data. Integrate data to make discoveries. Long term archive. Data intensive computing & HP storage Science Clouds Large data flows in and out Streaming processing required Commercial CSP As long as you pay the bill; as long as the business model holds. Internet style scale out and object-based storage Lots of small web flows NA Accounting Essential Essential Lock in Moving environment between CSPs essential Lock in is good

22 Part 3. The Open Cloud Consortium s Open Science Data Cloud

23 U.S based not-for-profit corporation. Manages cloud computing infrastructure to support scientific research: Open Science Data Cloud. Manages cloud computing testbeds: Open Cloud Testbed. 23

24 Cloud Services Operations Centers (CSOC) The OSDC operates Cloud Services Operations Center (or CSOC). It is a CSOC focused on supporting Science Clouds for researchers. Compare to Network Operations Center or NOC. Both are an important part of cyber infrastructure for big data science.

25 Different Styles of OSDC Racks 2012 OSDC rack design (draft) 950 TB / rack 600 cores / rack Design 1: Put cores over spindles. Higher cost but easy to compute over all the data. Design 2: separate (some of the )storage from the compute.

26 Open Science Data Cloud Monitoring, compliance, & security Automatic provisioning and infrastructure management Accounting and billing (OSDC) Science Cloud SW & Services 3 PB PB 2012 able to scale to 100 PB? 5-12 operators to operate 1-5 MW Science Cloud Data center network Customer Facing Portal (Tukey) ~100 Gbps bandwidth OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT,

27 OSDC Philosophy We try to automate as much as possible (we automate the setup & operations of a rack). We try to write as little software as possible. Each project is a bit different, but in general: We assign (permanent) IDs to data managed by the OSDC and manage associated metadata. We assign and enforce permissions for users & groups of users and for files/objects, collections of files/objects, and collections of collections. We Support RESTful interfaces. Do accounting for storage and core-hours.

28 Some Of Our Biggest Mistakes Not charging for services. This resulted in a lot of bad behavior. Trying to support donated equipment without adequate staff. Being too optimistic about when big data software would be ready for prime time. Some problems with big data software doesn t show up at less than the full scale of the OSDC, but we have only one OSDC and it is difficult to test at this scale.

29 Essential Services for a Science CSP Support for data intensive computing Support for big data flows Account management, authentication and authorization services Health and status monitoring Billing and accounting Ability to rapidly provision infrastructure Security services, logging, event reporting Access to large amounts of public data High performance storage Simple data export and import services

30 Number 1000 s Individual scientists & small projects 100 s 10 s Small Public infrastructure Community based science via Science as a Service very large projects Data Size Medium to Large Very Large Shared community infrastructure Dedicated infrastructure

31 Part 4. Bionimbus Bionimbus is a joint project between Laboratory For Advanced Computing & the White Lab at the University of Chicago.

32 Step 1. Prepare a Sample

33 Step 2. Login to Bionimbus and get a Bionimbus Key.

34 Step 3. Send your sample to the sequencing center.

35 Step 4. Login on to Bionimbus and view your data

36 Step 5. Use Bionimbus to perform standard and custom pipelines. Bionimbus can launch multiple virtual machines.

37 Bionimbus Virtual Machine Releases Peak Calling Quality Control Alignment & Genotyping MAT MA2C PeakSeq MACS SPP Various Bowtie TopHat Samtools Picard 37

38 Software Tools: Moving Genomes

39 Bionimbus Community Genomic Cloud researcher 1K genomes PubMed etc. Cloud for Public Data Personal dropbox + compute

40 Bionimbus Private Genomic Cloud researcher 1K genomes PubMed etc. Cloud for Public Data Personal dropbox & compute Cloud for Controlled Data TCGA dbgap

41 Bionimbus Private Biomedical Cloud researcher 1K genomes PubMed etc. Cloud for Public Data Personal dropbox plus compute Cloud for Controlled Data TCGA dbgap Scatter, gather queries Clinical Research Data Warehouse Cloud for PHI data

42 External sequencing partner Step 3b. Return variant calls, CNV, annotation Step 2. Send sample to be sequenced. Step 4. Secure data routing to appropriate cloud based upon BID. Internal Sequencers Step 3a. Return raw reads. Bionimbus Private Cloud UC BID Generator Step 5. Cloud based analysis using IGSB and 3 rd party tools and applications. Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc. Bionimbus Community Cloud Bionimbus Private Cloud XY dbgap Amazon

43 web2py-based Front End (Eucalyptus, OpenStack) (PostgreSQL) Utility Cloud Services (IDs, etc.) Database Services Data Ingestion Services Analysis Pipelines & Re-analysis Services Data Cloud Services Intercloud Services (UDT, replication) (Hadoop, Sector/Sphere)

44 >300 ChIP datasets -Chromatin/RNA timecourse -CBP -PolII -Pho/silencers -HDACs -Insulators -TFs Predictions 537 silencers 2,307 new promoters 12,285 enhancers 14,145 insulators Negre et al. Nature

45 Part 5. Managing One Million Genomes

46 Relational databases Summary level ( TB) Enrich with clinical data NoSql & scientific databases NoSql, DFS, file overlays? Variation (VCF) Files (1-10 PB) (Genomic variation) Sequence (BAM) Files ( PB) (Sequence data in binary form)

47 Acknowledgements Major funding and support for the Open Science Data Cloud is provided by the Gordon and Betty Moore Foundation, which has provided $2M of funding to the OSDC to launch Phase 1 of the project ( ). Moore Foundation funding is used to support the OSDC-Adler, Sullivan and Root facilities. Additional funding for the OSDC has been provided by the following sponsors: The OCC-Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was donated by Yahoo! in Cisco provides the OSDC access to the Cisco C-Wave, which connects OSDC data centers with 10 Gbps wide area networks. NSF awarded the OSDC a 5-year ( ) $3.5M PIRE award to train scientists to use the OSDC and to further develop the underlying technology. OSDC technology for high performance data transport is support in part by NSF Award The StarLight Facility in Chicago enables the OSDC to connect to over 30 high performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connections. The OSDC is managed by the Open Cloud Consortium, a 501(c)(3) not-for-profit corporation. If you are interested in providing funding or donating equipment or services, please contact us at info@opensciencedatacloud.org.

48 For more information You can find some more information on my blog: rgrossman.com. Some of my technical papers are also available there. My address is robert.grossman at uchicago dot edu

49 Sources for images The image of the hard disk is from Norlando Pobre, Creative Commons. The image of the Facebook Pineville Data Center is from the Intel Free Press, Creative Commons BY 2.0. The image of the LHC is from Conrad Melvin, Creative Commons BY-SA 2.0,

Using the Open Science Data Cloud for Data Science Research. Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013

Using the Open Science Data Cloud for Data Science Research. Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013 Using the Open Science Data Cloud for Data Science Research Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013 Discoveries Team: you and your colleagues correla=on + algorithms +

More information

Florida International University

Florida International University Florida International University PARTNERSHIP FOR INTERNATIONAL RESEARCH AND EDUCATION TERENA June 3 rd,2013 Julio Ibarra, PhD. Assistant Vice President of Technology Augmented Research (CIARA) The Open

More information

CDIS Biomedical Data Commons

CDIS Biomedical Data Commons CDIS Biomedical Data Commons Computational Life Science Seminar Series October 18, 2017 Michael Fitzsimons Center for Data Intensive Science Agenda What is a Data Commons? Data Commons at CDIS NCI GDC

More information

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure TM DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure About DRAGEN Edico Genome s DRAGEN TM (Dynamic Read Analysis for GENomics) Bio-IT Platform provides ultra-rapid secondary analysis of

More information

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016 The Data exacell DXC J. Ray Scott DXC PI May 17, 2016 DXC Leadership Mike Levine Co-Scientific Director Co-PI Nick Nystrom Senior Director of Research Co-PI Ralph Roskies Co-Scientific Director Co-PI Robin

More information

Data Centres in the Virtual Observatory Age

Data Centres in the Virtual Observatory Age Data Centres in the Virtual Observatory Age David Schade Canadian Astronomy Data Centre A few things I ve learned in the past two days There exist serious efforts at Long-Term Data Preservation Alliance

More information

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France ERF, Big data & Open data Brussels, 7-8 May 2014 EU-T0, Data

More information

Genomics on Cisco Metacloud + SwiftStack

Genomics on Cisco Metacloud + SwiftStack Genomics on Cisco Metacloud + SwiftStack Technology is a large component of driving discovery in both research and providing timely answers for clinical treatments. Advances in genomic sequencing have

More information

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science CSD3 The Cambridge Service for Data Driven Discovery A New National HPC Service for Data Intensive science Dr Paul Calleja Director of Research Computing University of Cambridge Problem statement Today

More information

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF Conference 2017 The Data Challenges of the LHC Reda Tafirout, TRIUMF Outline LHC Science goals, tools and data Worldwide LHC Computing Grid Collaboration & Scale Key challenges Networking ATLAS experiment

More information

Database Management Systems

Database Management Systems Database Management Systems Fall 2017 Knowledge is of two kinds: we know a subject ourselves, or we know where we can find information upon it. -- Samuel Johnson (1709-1784) Queries for Today Why? Who?

More information

5 Fundamental Strategies for Building a Data-centered Data Center

5 Fundamental Strategies for Building a Data-centered Data Center 5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2019 by Leonidas Fegaras Cloud Computing Fundamentals Based on: J. Freire s class notes on Big Data http://vgc.poly.edu/~juliana/courses/bigdata2016/

More information

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008 CERN openlab II CERN openlab and Intel: Today and Tomorrow Sverre Jarp CERN openlab CTO 16 September 2008 Overview of CERN 2 CERN is the world's largest particle physics centre What is CERN? Particle physics

More information

The Bionimbus PDC: Obtaining Access FAQ

The Bionimbus PDC: Obtaining Access FAQ The Bionimbus PDC: Obtaining Access FAQ TABLE OF CONTENTS PREREQUISITES 3 LEGAL DOCUMENTS 3 SECURITY TRAINING 3 GENERAL GUIDELINES 4 AUTH METHOD 1: USING AN ERA TO GAIN ACCESS TO A DBGAP DATASET 5 GETTING

More information

igeni: International Global Environment for Network Innovations

igeni: International Global Environment for Network Innovations igeni: International Global Environment for Network Innovations Joe Mambretti, Director, (j-mambretti@northwestern.edu) International Center for Advanced Internet Research (www.icair.org) Northwestern

More information

The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research

The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research Dr Paul Calleja Director of Research Computing University of Cambridge Global leader in science & technology

More information

Grid Computing: dealing with GB/s dataflows

Grid Computing: dealing with GB/s dataflows Grid Computing: dealing with GB/s dataflows Jan Just Keijser, Nikhef janjust@nikhef.nl David Groep, NIKHEF 3 May 2012 Graphics: Real Time Monitor, Gidon Moont, Imperial College London, see http://gridportal.hep.ph.ic.ac.uk/rtm/

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information

From Internet Data Centers to Data Centers in the Cloud

From Internet Data Centers to Data Centers in the Cloud From Internet Data Centers to Data Centers in the Cloud This case study is a short extract from a keynote address given to the Doctoral Symposium at Middleware 2009 by Lucy Cherkasova of HP Research Labs

More information

ICN for Cloud Networking. Lotfi Benmohamed Advanced Network Technologies Division NIST Information Technology Laboratory

ICN for Cloud Networking. Lotfi Benmohamed Advanced Network Technologies Division NIST Information Technology Laboratory ICN for Cloud Networking Lotfi Benmohamed Advanced Network Technologies Division NIST Information Technology Laboratory Information-Access Dominates Today s Internet is focused on point-to-point communication

More information

EGI: Linking digital resources across Eastern Europe for European science and innovation

EGI: Linking digital resources across Eastern Europe for European science and innovation EGI- InSPIRE EGI: Linking digital resources across Eastern Europe for European science and innovation Steven Newhouse EGI.eu Director 12/19/12 EPE 2012 1 EGI European Over 35 countries Grid Secure sharing

More information

QLIK INTEGRATION WITH AMAZON REDSHIFT

QLIK INTEGRATION WITH AMAZON REDSHIFT QLIK INTEGRATION WITH AMAZON REDSHIFT Qlik Partner Engineering Created August 2016, last updated March 2017 Contents Introduction... 2 About Amazon Web Services (AWS)... 2 About Amazon Redshift... 2 Qlik

More information

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development Jeremy Fischer Indiana University 9 September 2014 Citation: Fischer, J.L. 2014. ACCI Recommendations on Long Term

More information

CC-IN2P3: A High Performance Data Center for Research

CC-IN2P3: A High Performance Data Center for Research April 15 th, 2011 CC-IN2P3: A High Performance Data Center for Research Toward a partnership with DELL Dominique Boutigny Agenda Welcome Introduction to CC-IN2P3 Visit of the computer room Lunch Discussion

More information

Gigabyte Bandwidth Enables Global Co-Laboratories

Gigabyte Bandwidth Enables Global Co-Laboratories Gigabyte Bandwidth Enables Global Co-Laboratories Prof. Harvey Newman, Caltech Jim Gray, Microsoft Presented at Windows Hardware Engineering Conference Seattle, WA, 2 May 2004 Credits: This represents

More information

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis W I S S E N n T E C H N I K n L E I D E N S C H A F T Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis Elisabeth Lex KTI, TU Graz WS 2015/16 u www.tugraz.at

More information

Grid Computing: dealing with GB/s dataflows

Grid Computing: dealing with GB/s dataflows Grid Computing: dealing with GB/s dataflows Jan Just Keijser, Nikhef janjust@nikhef.nl David Groep, NIKHEF 21 March 2011 Graphics: Real Time Monitor, Gidon Moont, Imperial College London, see http://gridportal.hep.ph.ic.ac.uk/rtm/

More information

Big Data 2015: Sponsor and Participants Research Event ""

Big Data 2015: Sponsor and Participants Research Event Big Data 2015: Sponsor and Participants Research Event "" Center for Large-scale Data Systems Research, CLDS! San Diego Supercomputer Center! UC San Diego! Agenda" Welcome and introductions! SDSC: Who

More information

The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists

The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused Technical Workshop. Berkeley, CA July 17-18, 2013

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Data publication and discovery with Globus

Data publication and discovery with Globus Data publication and discovery with Globus Questions and comments to outreach@globus.org The Globus data publication and discovery services make it easy for institutions and projects to establish collections,

More information

Flexible HPC for Bio-informatics. Peter Clapham

Flexible HPC for Bio-informatics. Peter Clapham Flexible HPC for Bio-informatics Peter Clapham Overview Overview of the Sanger Institute How our data flow works today New scientific demands Private cloud deployment Transitional and future challenges

More information

SPARC 2 Consultations January-February 2016

SPARC 2 Consultations January-February 2016 SPARC 2 Consultations January-February 2016 1 Outline Introduction to Compute Canada SPARC 2 Consultation Context Capital Deployment Plan Services Plan Access and Allocation Policies (RAC, etc.) Discussion

More information

CloudLab. Updated: 5/24/16

CloudLab. Updated: 5/24/16 2 The Need Addressed by Clouds are changing the way we look at a lot of problems Impacts go far beyond Computer Science but there's still a lot we don't know, from perspective of Researchers (those who

More information

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Big Data - Some Words BIG DATA 8/31/2017. Introduction BIG DATA Introduction Big Data - Some Words Connectivity Social Medias Share information Interactivity People Business Data Data mining Text mining Business Intelligence 1 What is Big Data Big Data means

More information

The Canadian CyberSKA Project

The Canadian CyberSKA Project The Canadian CyberSKA Project A. G. Willis (on behalf of the CyberSKA Project Team) National Research Council of Canada Herzberg Institute of Astrophysics Dominion Radio Astrophysical Observatory May 24,

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

irods at TACC: Secure Infrastructure for Open Science Chris Jordan irods at TACC: Secure Infrastructure for Open Science Chris Jordan What is TACC? Texas Advanced Computing Center Cyberinfrastructure Resources for Open Science University of Texas System 9 Academic, 6

More information

Summary of Data Management Principles

Summary of Data Management Principles Large Synoptic Survey Telescope (LSST) Summary of Data Management Principles Steven M. Kahn LPM-151 Latest Revision: June 30, 2015 Change Record Version Date Description Owner name 1 6/30/2015 Initial

More information

Yajing (Phillis)Tang. Walt Wells

Yajing (Phillis)Tang. Walt Wells Building on the NOAA Big Data Project for Academic Research: An OCC Maria Patterson Perspective Zachary Flamig Yajing (Phillis)Tang Walt Wells Robert Grossman We have a problem The commoditization of sensors

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Best Practices and Performance Tuning on Amazon Elastic MapReduce Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or

More information

The NIH Collaboratory Distributed Research Network: A Privacy Protecting Method for Sharing Research Data Sets

The NIH Collaboratory Distributed Research Network: A Privacy Protecting Method for Sharing Research Data Sets The NIH Collaboratory Distributed Research Network: A Privacy Protecting Method for Sharing Research Data Sets Jeffrey Brown, Lesley Curtis, and Rich Platt June 13, 2014 Previously The NIH Collaboratory:

More information

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez Scientific data processing at global scale The LHC Computing Grid Chengdu (China), July 5th 2011 Who I am 2 Computing science background Working in the field of computing for high-energy physics since

More information

eresearch UCT Jason van Rooyen, PhD eresearch Analyst

eresearch UCT Jason van Rooyen, PhD eresearch Analyst eresearch UCT Jason van Rooyen, PhD eresearch Analyst www.eresearch.uct.ac.za Libraries http://www.canberra.edu.au/research/ucresearch/e-research Libraries eresearch is 21 st century discovery through

More information

An Overview of the Open Science Data Cloud

An Overview of the Open Science Data Cloud An Overview of the Open Science Data Cloud Robert L. Grossman University of Illinois at Chicago Michal Sabala University of Illinois at Chicago Yunhong Gu University of Illinois at Chicago Alex Szalay

More information

Travelling securely on the Grid to the origin of the Universe

Travelling securely on the Grid to the origin of the Universe 1 Travelling securely on the Grid to the origin of the Universe F-Secure SPECIES 2007 conference Wolfgang von Rüden 1 Head, IT Department, CERN, Geneva 24 January 2007 2 CERN stands for over 50 years of

More information

Mitigating Risk of Data Loss in Preservation Environments

Mitigating Risk of Data Loss in Preservation Environments Storage Resource Broker Mitigating Risk of Data Loss in Preservation Environments Reagan W. Moore San Diego Supercomputer Center Joseph JaJa University of Maryland Robert Chadduck National Archives and

More information

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era to meet the computing requirements of the HL-LHC era NPI AS CR Prague/Rez E-mail: adamova@ujf.cas.cz Maarten Litmaath CERN E-mail: Maarten.Litmaath@cern.ch The performance of the Large Hadron Collider

More information

Decrypting your genome data privately in the cloud

Decrypting your genome data privately in the cloud Decrypting your genome data privately in the cloud Marc Sitges Data Manager@Made of Genes @madeofgenes The Human Genome 3.200 M (x2) Base pairs (bp) ~20.000 genes (~30%) (Exons ~1%) The Human Genome Project

More information

Turning Data Science into a reality with TIBCO Spotfire

Turning Data Science into a reality with TIBCO Spotfire Turning Data Science into a reality with TIBCO Spotfire Eduardo Gonzalez-Couto, Ph.D. Product Manager, PerkinElmer Informatics Basel, 3 rd November 2016 Safe Harbor Statement This document shows current

More information

De BiG Grid e-infrastructuur digitaal onderzoek verbonden

De BiG Grid e-infrastructuur digitaal onderzoek verbonden Graphics: Real Time Monitor, Gidon Moont, Imperial College London, see http://gridportal.hep.ph.ic.ac.uk/rtm/ De BiG Grid e-infrastructuur digitaal onderzoek verbonden David Groep, Nikhef KennisKring Amsterdam

More information

Data Mining and Warehousing

Data Mining and Warehousing Data Mining and Warehousing Sangeetha K V I st MCA Adhiyamaan College of Engineering, Hosur-635109. E-mail:veerasangee1989@gmail.com Rajeshwari P I st MCA Adhiyamaan College of Engineering, Hosur-635109.

More information

The CEDA Archive: Data, Services and Infrastructure

The CEDA Archive: Data, Services and Infrastructure The CEDA Archive: Data, Services and Infrastructure Kevin Marsh Centre for Environmental Data Archival (CEDA) www.ceda.ac.uk with thanks to V. Bennett, P. Kershaw, S. Donegan and the rest of the CEDA Team

More information

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products What is big data? Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

More information

Towards a Strategy for Data Sciences at UW

Towards a Strategy for Data Sciences at UW Towards a Strategy for Data Sciences at UW Albrecht Karle Department of Physics June 2017 High performance compu0ng infrastructure: Perspec0ves from Physics Exis0ng infrastructure and projected future

More information

Grid Computing a new tool for science

Grid Computing a new tool for science Grid Computing a new tool for science CERN, the European Organization for Nuclear Research Dr. Wolfgang von Rüden Wolfgang von Rüden, CERN, IT Department Grid Computing July 2006 CERN stands for over 50

More information

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill Introduction to FREE National Resources for Scientific Computing Dana Brunson Oklahoma State University High Performance Computing Center Jeff Pummill University of Arkansas High Peformance Computing Center

More information

e-infrastructures in FP7 INFO DAY - Paris

e-infrastructures in FP7 INFO DAY - Paris e-infrastructures in FP7 INFO DAY - Paris Carlos Morais Pires European Commission DG INFSO GÉANT & e-infrastructure Unit 1 Global challenges with high societal impact Big Science and the role of empowered

More information

Maximizing Public Data Sources for Sequencing and GWAS

Maximizing Public Data Sources for Sequencing and GWAS Maximizing Public Data Sources for Sequencing and GWAS February 4, 2014 G Bryce Christensen Director of Services Questions during the presentation Use the Questions pane in your GoToWebinar window Agenda

More information

A VO-friendly, Community-based Authorization Framework

A VO-friendly, Community-based Authorization Framework A VO-friendly, Community-based Authorization Framework Part 1: Use Cases, Requirements, and Approach Ray Plante and Bruce Loftis NCSA Version 0.1 (February 11, 2005) Abstract The era of massive surveys

More information

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel National Center for Supercomputing Applications University of Illinois

More information

Forget about the Clouds, Shoot for the MOON

Forget about the Clouds, Shoot for the MOON Forget about the Clouds, Shoot for the MOON Wu FENG feng@cs.vt.edu Dept. of Computer Science Dept. of Electrical & Computer Engineering Virginia Bioinformatics Institute September 2012, W. Feng Motivation

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

The OpenCirrus TM Project: A global Testbed for Cloud Computing R&D

The OpenCirrus TM Project: A global Testbed for Cloud Computing R&D The OpenCirrus TM Project: A global Testbed for Cloud Computing R&D Marcel Kunze Steinbuch Centre for Computing (SCC) Karlsruhe Institute of Technology (KIT) Germany KIT The cooperation of Forschungszentrum

More information

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium Storage on the Lunatic Fringe Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium tmruwart@dtc.umn.edu Orientation Who are the lunatics? What are their requirements?

More information

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University Data Intensive Scalable Computing Thanks to: Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Big Data Sources: Seismic Simulations Wave propagation during an earthquake Large-scale

More information

Georgia State University Cyberinfrastructure Plan

Georgia State University Cyberinfrastructure Plan Georgia State University Cyberinfrastructure Plan Summary Building relationships with a wide ecosystem of partners, technology, and researchers are important for GSU to expand its innovative improvements

More information

CANARIE: Providing Essential Digital Infrastructure for Canada

CANARIE: Providing Essential Digital Infrastructure for Canada CANARIE: Providing Essential Digital Infrastructure for Canada Mark Wolff; CTO April 16, 2014 A Transformation of the Science Paradigm thousands of years ago last few hundred years last few decades today

More information

Galaxy workshop at the Winter School Igor Makunin

Galaxy workshop at the Winter School Igor Makunin Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis

More information

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice 2014 年 3 月 13 日星期四 From Big Data to Big Value Infrastructure Needs and Huawei Best Practice Data-driven insight Making better, more informed decisions, faster Raw Data Capture Store Process Insight 1 Data

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 25: Parallel Databases CSE 344 - Winter 2013 1 Announcements Webquiz due tonight last WQ! J HW7 due on Wednesday HW8 will be posted soon Will take more hours

More information

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER V.V. Korenkov 1, N.A. Kutovskiy 1, N.A. Balashov 1, V.T. Dimitrov 2,a, R.D. Hristova 2, K.T. Kouzmov 2, S.T. Hristov 3 1 Laboratory of Information

More information

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing

More information

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography Christopher Crosby, San Diego Supercomputer Center J Ramon Arrowsmith, Arizona State University Chaitan

More information

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Outline Big Data Big Data Examples Challenges with traditional storage NoSQL Hadoop HDFS MapReduce Architecture 2 Big Data In information

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

The CMS Computing Model

The CMS Computing Model The CMS Computing Model Dorian Kcira California Institute of Technology SuperComputing 2009 November 14-20 2009, Portland, OR CERN s Large Hadron Collider 5000+ Physicists/Engineers 300+ Institutes 70+

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

CERN s Business Computing

CERN s Business Computing CERN s Business Computing Where Accelerated the infinitely by Large Pentaho Meets the Infinitely small Jan Janke Deputy Group Leader CERN Administrative Information Systems Group CERN World s Leading Particle

More information

A Better Approach to Leveraging an OpenStack Private Cloud. David Linthicum

A Better Approach to Leveraging an OpenStack Private Cloud. David Linthicum A Better Approach to Leveraging an OpenStack Private Cloud David Linthicum A Better Approach to Leveraging an OpenStack Private Cloud 1 Executive Summary The latest bi-annual survey data of OpenStack users

More information

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST)

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST) Building on Existing Communities: the Virtual Astronomical Observatory (and NIST) Robert Hanisch Space Telescope Science Institute Director, Virtual Astronomical Observatory Data in astronomy 2 ~70 major

More information

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013 Data Replication: Automated move and copy of data PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013 Claudio Cacciari c.cacciari@cineca.it Outline The issue

More information

High Performance Computing Resources at MSU

High Performance Computing Resources at MSU MICHIGAN STATE UNIVERSITY High Performance Computing Resources at MSU Last Update: August 15, 2017 Institute for Cyber-Enabled Research Misson icer is MSU s central research computing facility. The unit

More information

What s New at AWS? looking at just a few new things for Enterprise. Philipp Behre, Enterprise Solutions Architect, Amazon Web Services

What s New at AWS? looking at just a few new things for Enterprise. Philipp Behre, Enterprise Solutions Architect, Amazon Web Services What s New at AWS? looking at just a few new things for Enterprise Philipp Behre, Enterprise Solutions Architect, Amazon Web Services 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

More information

Next Generation Science and Infrastructure Support

Next Generation Science and Infrastructure Support Next Generation Science and Infrastructure Support James Lowey Director Network & Computing Systems TGEN The Translational Genomics Research Institute (TGen) Non-profit Biomedical research institute Founded

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Financed by the European Commission 7 th Framework Programme. biobankcloud.eu. Jim Dowling, PhD Assoc. Prof, KTH Project Coordinator

Financed by the European Commission 7 th Framework Programme. biobankcloud.eu. Jim Dowling, PhD Assoc. Prof, KTH Project Coordinator Financed by the European Commission 7 th Framework Programme. biobankcloud.eu Jim Dowling, PhD Assoc. Prof, KTH Project Coordinator The Biobank Bottleneck We will soon be generating massive amounts of

More information

The Materials Data Facility

The Materials Data Facility The Materials Data Facility Ben Blaiszik (blaiszik@uchicago.edu), Kyle Chard (chard@uchicago.edu) Ian Foster (foster@uchicago.edu) materialsdatafacility.org What is MDF? We aim to make it simple for materials

More information

In-Memory Technology in Life Sciences

In-Memory Technology in Life Sciences in Life Sciences Dr. Matthieu-P. Schapranow In-Memory Database Applications in Healthcare 2016 Apr Intelligent Healthcare Networks in the 21 st Century? Hospital Research Center Laboratory Researcher Clinician

More information

High-Energy Physics Data-Storage Challenges

High-Energy Physics Data-Storage Challenges High-Energy Physics Data-Storage Challenges Richard P. Mount SLAC SC2003 Experimental HENP Understanding the quantum world requires: Repeated measurement billions of collisions Large (500 2000 physicist)

More information

Modelos de Negócio na Era das Clouds. André Rodrigues, Cloud Systems Engineer

Modelos de Negócio na Era das Clouds. André Rodrigues, Cloud Systems Engineer Modelos de Negócio na Era das Clouds André Rodrigues, Cloud Systems Engineer Agenda Software and Cloud Changed the World Cisco s Cloud Vision&Strategy 5 Phase Cloud Plan Before Now From idea to production:

More information

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori

More information

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data on AWS Big Data Agility and Performance Delivered in the Cloud 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Technologies and techniques for working productively

More information

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,

More information

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security Bringing OpenStack to the Enterprise An enterprise-class solution ensures you get the required performance, reliability, and security INTRODUCTION Organizations today frequently need to quickly get systems

More information

Cisco Unified Computing System

Cisco Unified Computing System Cisco Unified Computing System Architected for Workload Diversity and Fast IT Todd Brannon, Director of Product Marketing, Unified Computing tobranno@cisco.com @tobranno Agenda Applications & Architecture

More information

SoftNAS Cloud Performance Evaluation on AWS

SoftNAS Cloud Performance Evaluation on AWS SoftNAS Cloud Performance Evaluation on AWS October 25, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for AWS:... 5 Test Methodology... 6 Performance Summary

More information

Portal: Applications of New Technology to Transportation Data Archiving. Kristin Tufte & the Portal Team NATMEC, July 1, 2014, Chicago, IL

Portal: Applications of New Technology to Transportation Data Archiving. Kristin Tufte & the Portal Team NATMEC, July 1, 2014, Chicago, IL + Portal: Applications of New Technology to Transportation Data Archiving Kristin Tufte & the Portal Team NATMEC, July, 24, Chicago, IL + Who is Kristin? 2 years Data Management System Design and Implementation

More information