The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists

Similar documents
Jetstream: A science & engineering cloud

$ whoami. Carrie Ganote. id Group: NCGAS National Center for Genome Analysis Support

Jetstream Overview A national research and education cloud

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

The Future of Galaxy. Nate Coraor galaxyproject.org

JIM WILLIAMS DIRECTOR, INTERNATIONAL NETWORKING INDIANA UNIVERSITY BLOOMINGTON, IN - USA

Graphical Access to IU's Supercomputers with Karst Desktop Beta

Cloud Facility for Advancing Scientific Communities

HPC Capabilities at Research Intensive Universities

Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure

Research Cyberinfrastructure Upgrade Proposal - CITI

VT EGI-ELIXIR : French node

International Big Science Coming to Your Campus Soon (Sooner Than You Think )

America Connects to Europe (ACE) (Award # ) Year 7 Annual Report 1- Mar through 31- May Jennifer Schopf Principal Investigator

The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research

Cyberinfrastructure!

XSEDE Campus Bridging Tools Rich Knepper Jim Ferguson

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

Data Movement & Storage Using the Data Capacitor Filesystem

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

NUIT Tech Talk Topics in Research Computing: XSEDE and Northwestern University Campus Champions

Future of Enzo. Michael L. Norman James Bordner LCA/SDSC/UCSD

Data Intensive Science Impact on Networks

How Five International Networks are Enabling International Data-Intensive Research. Internet2 Global Summit 2014

HIGH PERFORMANCE COMPUTING (PLATFORMS) SECURITY AND OPERATIONS

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Leveraging the InCommon Federation to access the NSF TeraGrid

The GISandbox: A Science Gateway For Geospatial Computing. Davide Del Vento, Eric Shook, Andrea Zonca

Edward Seidel Director, National Center for Supercomputing Applications Founder Prof. of Physics, U of Illinois

LHC and LSST Use Cases

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

APAC: A National Research Infrastructure Program

2017 Resource Allocations Competition Results

Innovation and the Internet

ALICE Grid Activities in US

EGI: Linking digital resources across Eastern Europe for European science and innovation

Galaxy. Data intensive biology for everyone. / #usegalaxy

Using Galaxy to provide a NGS Analysis Platform

XSEDE Campus Bridging Tools Jim Ferguson National Institute for Computational Sciences

Overview of HPC at LONI

Galaxy workshop at the Winter School Igor Makunin

Regional & National HPC resources available to UCSB

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

DDN Annual High Performance Computing Trends Survey Reveals Rising Deployment of Flash Tiers & Private/Hybrid Clouds vs.

Travelling securely on the Grid to the origin of the Universe

Indiana University s Lustre WAN: The TeraGrid and Beyond

Scaling Across the NRP Ecosystem From Campus to Regional to National - What Support Is There? 2NRP Workshop Bozeman, Montana Tuesday, August 7, 2018

OpenStack from the basics to production: Experiences on Jetstream

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Next Generation Science and Infrastructure Support

TeraGrid TeraGrid and the Path to Petascale

Sri Lanka THE JOURNEY OF TOWARDS A CREATIVE KNOWLEDGE BASED ECONOMY

Adding Cloud Based Interactive Compute Capabilities to Globus Endpoints

Programmable Information Highway (with no Traffic Jams)

Data publication and discovery with Globus

Does Research ICT KALRO? Transforming education using ICT

globus online The Galaxy Project and Globus Online

The Science DMZ: Evolution

Applications and Services November, 10, 2011

Campus Bridging: What is it and how do we do it? Rich Knepper

ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing

Climate Data Management using Globus

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

eresearch UCT Jason van Rooyen, PhD eresearch Analyst

Using the Galaxy Local Bioinformatics Cloud at CARC

Network Support for Data Intensive Science

Engagement With Scientific Facilities

Simon Mercer Director, Health & Wellbeing Microsoft Corporation

Award No /1/2016 NSF-Arlington, VA

CloudLab. Updated: 5/24/16

biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Higher Education in Texas: Serving Texas Through Transformational Education, Research, Discovery & Impact

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

National Level Computing at UTK. Jim Ferguson NICS Director of Education, Outreach & Training August 19, 2011

Building Bridges: A System for New HPC Communities

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

Pre-Workshop Training materials to move you from Data to Discovery. Get Science Done. Reproducibly.

SEAD Data Services. Jim Best Practices in Data Infrastructure Workshop. Cooperative agreement #OCI

SUBSCRIPTION OVERVIEW

GÉANT: Supporting R&E Collaboration

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

SMART. Investing in urban innovation

Developing Applications with Networking Capabilities via End-to-End Software Defined Networking (DANCES)

ArcGIS Enterprise: Architecture & Deployment. Anthony Myers

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Towards a Strategy for Data Sciences at UW

IRNC:RXP SDN / SDX Update

Grid Computing: dealing with GB/s dataflows

Copyright 2010 EMC Corporation. All rights reserved. CLOUD MEETS BIG DATA. Sujal Patel President, Isilon Storage Division EMC Corporation

XLDB 11 Cloud Computing at Scale. Roger Barga Microsoft Research

Grid Computing a new tool for science

Collecting Public Health Data at a Global Scale. D. Cenk Erdil, PhD Marist College

DIGITAL CENTRAL ASIA SOUTH ASIA (CASA) PROGRAM. Transport and ICT Global Practice World Bank

Grid Computing: dealing with GB/s dataflows

Hardware Guide. Hardware Guide. Deployments up to 250 Users. Microsoft Dynamics NAV. White Paper. Version 1 (April 19, 2006)

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

CHAMELEON: A LARGE-SCALE, RECONFIGURABLE EXPERIMENTAL ENVIRONMENT FOR CLOUD RESEARCH

Transcription:

The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused Technical Workshop. Berkeley, CA July 17-18, 2013 William K. Barnett, Ph.D. National Center for Genome Analysis Support

Summary The NGS Big Data Problem NCGAS as a National Model Current State and Prospects

The Next Generation Sequencing Big Data Problem

Changing genomics data environment Sequencing is: Ø Ø Ø Becoming commoditized at large centers, Multiplying at individual labs, and Generating (MUCH) more data Analytical capability lacks: Ø Ø Ø Bioinformatics support Computational support Storage support

Sequencing is getting cheaper Source: http://www.genome.gov/images/content/cost_per_genome.jpg

NGS Data growth is outstripping Storage growth Source: Stein Genome Biology 2010 11:207 doi:10.1186/gb-2010-11-5-207

omicsmaps.com listed 931 Next Generation Sequencers in the US on April 29, 2013 7

Genomics Data are Big Source: Nancy Spinner Keynote at 2012 Internet2 meeting: http://events.internet2.edu/2012/fall-mm/agenda.cfm?go=session&id=10002624&event=1149

But other disciplines already handle Bigger One Degree Imager: 1.4 Petabytes per year Large Hadron Collider: 15 Petabytes per year. The Open Science Grid moves 2 Petabytes per day The Square Kilometer Array: 2,000 Petabytes per year in 2013

The National Center for Genome Analysis Support as a National Model

1. Bioinformatics consulting for biologists 2. Large memory clusters for assembly 3. Optimized software for better efficiency Initially Funded by the National Science Foundation Collaboration across multiple institutions Open for business at: http://ncgas.org

120 100 100 Research networks are now (much) faster than hard drives 80 60 40 40 Gbps 20 0 Internet2 Data Capacitor 10 NLR 1.2 1 Ultra SCSI Disk Commodity Internet

NCGAS Service Model Aligned with HIPAA for Clinical research at IU Public Cloud Software as a Service Platform as a Service Infrastructure as a Service NEEDS Bioinformatics Applications Services Layer OS Layer Hardware Layer Network Layer NCGAS Expert Consulting Galaxy and Tuned Apps Parallel Optimized Envs. Professional Admins Supercomputer Cntrs 100 Gbps Internet2

NCGAS Virtual Instrument IU 6 PB Storage TACC NSF-Funded or XSEDE Allocation Federally Funded NCGAS Galaxy Portal POD Galaxy Portal Mason POD 5 PB D.C. 100 Gig Internet2 5.5 PB Storage 4 PB Storage SDSC PSC Sequencing Center 10 Gig NLR NCBI

How does this work at scale? 1. Researchers use Galaxy to transfer files and run jobs Files transferred to Data Capacitor over Internet2 Data Capacitor mounted on several Clusters Data Capacitor mounts reference data from NCBI 2. Workflows execute on best system for that analysis 3. Results delivered through web interfaces and to visualization or other science tools

Rates For NSF Funded Researchers at IU or XSEDE Allocations: Consulting: $0 currently 2.5 FTE bioinformaticians Data Transfer: $0 Data Storage: $0 contemplating 25 TB allocation/project Computation: $0 Other Federally funded Researchers running on the POD Consulting: $60/hour for short consults Data Transfer: $20 per transfer Data Storage: $.10/GB/month Computation: $.09/core hour (128GB/node cluster)

Current State and Prospects

What have we done so far? 1. Partnering on 37 research projects, managing 73 TB 2. IU infrastructure aligned with HIPAA 3. XSEDE Tier 2 Service Provider 4. Optimized Trinity with Broad Institute 5. LustreWAN node at U. Hawaii at Hilo Next: 1. LustreWAN node at NCBI (July 2013) 2. More Science 3. More LustreWAN nodes at sequencing sites

NCGAS Partners Funding From:

Questions? Bill Barnett (barnettw@iu.edu) The NCGAS Team at IU: Rich LeDuc (rleduc@iu.edu) Le-Shin Wu (lewu@iu.edu) Thomas Doak (tdoak@indiana.edu) Carrie Ganote (cganote@iu.edu)

Acknowledgements and Disclaimers This material is based upon work supported by the National Science Foundation under Grant No. ABI-1062432, Craig Stewart, PI. William Barnett, Matthew Hahn, and Michael Lynch, co-pis. This work was supported in part by the Lilly Endowment, Inc. and the Indiana University Pervasive Technology Institute Any opinions presented here are those of the presenter(s) and do not necessarily represent the opinions of the National Science Foundation or any other funding agencies

License Terms Please cite as: Barnett, William K, The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists, presented at the Internet2 Network Infrastructure for the Life Sciences Focused Technical Workshop. Berkeley, CA. Items indicated with a are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse. Except where otherwise noted, contents of this presentation are copyright 2011 by the Trustees of Indiana University. This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share to copy, distribute and transmit the work and to remix to adapt the work under the following conditions: attribution you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

JIM WILLIAMS DIRECTOR, INTERNATIONAL NETWORKING WILLIAM@INDIANA.EDU

US DOMESTIC NETWORKING Very strong high-performance backbone (100G national footprint) both from ESnet and Internet2 Performance to your institution and your lab may vary due to local considerations Internet2 and ESnet have network related scientist support functions and services

US AND INTERNATIONAL NETWORKING NSF funded IRNC program provides 10G connectivity to Europe, Asia and South/Latin America Within those regions national networks provide varying degrees of connectivity IRNC investigators anxious to assist researchers with their connectivity issues

FUTURES Domestic networking in many countries is 100G. International networking moving in that direction. ANA-100G supplies 100G TA connectivity between NYC and AMS. ESnet interested in extending the 100G ESnet network to Europe (EEX)

SUPPORT IS ALWAYS THE ISSUE To contact IRNC investigators: http://www.irnclinks.net/ To contact Internet2: http://www.internet2.edu/science/ http://www.internet2.edu/health/ To contact ESnet: http://es.net/