The Future of Galaxy. Nate Coraor galaxyproject.org

Similar documents
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists

Federated Services for Scientists Thursday, December 9, p.m. EST

Galaxy. Data intensive biology for everyone. / #usegalaxy

Data publication and discovery with Globus

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016

globus online Globus Nexus Steve Tuecke Computation Institute University of Chicago and Argonne National Laboratory

The Data Exacell (DXC): Data Infrastructure Building Blocks for Integrating Analytics with Data Management

Science-as-a-Service

Leveraging the InCommon Federation to access the NSF TeraGrid

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

Galaxy. Data intensive biology for everyone. / #usegalaxy

Overview of HPC at LONI

Getting Started with XSEDE. Dan Stanzione

Climate Data Management using Globus

Regional & National HPC resources available to UCSB

HPC Capabilities at Research Intensive Universities

XSEDE s Campus Bridging Project Jim Ferguson National Institute for Computational Sciences

The Materials Data Facility

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

The SciTokens Authorization Model: JSON Web Tokens & OAuth

Goal. TeraGrid. Challenges. Federated Login to TeraGrid

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

XSEDE New User Training. Ritu Arora November 14, 2014

Introduction to SciTokens

Building Bridges: A System for New HPC Communities

The State of the Raven. Jon Warbrick University of Cambridge Computing Service

Building the Modern Research Data Portal using the Globus Platform. Rachana Ananthakrishnan GlobusWorld 2017

CILogon Project

Data Movement & Storage Using the Data Capacitor Filesystem

Leveraging Globus Identity for the Grid. Suchandra Thapa GlobusWorld, April 22, 2016 Chicago

Lecture 1: January 22

The LGI Pilot job portal. EGI Technical Forum 20 September 2011 Jan Just Keijser Willem van Engen Mark Somers

CILogon. Federating Non-Web Applications: An Update. Terry Fleury

A Big Big Data Platform

NUIT Tech Talk Topics in Research Computing: XSEDE and Northwestern University Campus Champions

Galaxy a community driven platform for accessible, transparent, and reproducible data science

$ whoami. Carrie Ganote. id Group: NCGAS National Center for Genome Analysis Support

Data Transfers in the Grid: Workload Analysis of Globus GridFTP

The NIH Big Data to Knowledge Initiative: Raising the Prominence of Data

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. 20 November 2017 (12:10 GMT) Beginner.

Lecture 1: January 23

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Federated XDMoD Requirements

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

INDIGO AAI An overview and status update!

Connected Mobility Digital Ecosystem: A Case Study on Intelligent Transport Analytics

Clouds: An Opportunity for Scientific Applications?

Research Cyberinfrastructure Upgrade Proposal - CITI

COMPTIA CLO-001 EXAM QUESTIONS & ANSWERS

PetaLibrary Storage Service MOU

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Dataverse: Modular Storage and Migration to the Cloud

Visualization for Scientists. We discuss how Deluge and Complexity call for new ideas in data exploration. Learn more, find tools at layerscape.

Globus Research Data Management: Campus Deployment and Configuration. Steve Tuecke Vas Vasiliadis

Indiana University s Lustre WAN: The TeraGrid and Beyond

SUG Breakout Session: OSC OnDemand App Development

SLATE. Services Layer at the Edge. First Meeting of the National Research Platform Montana State University August 7-8, 2017

Cornell Red Cloud: Campus-based Hybrid Cloud. Steven Lee Cornell University Center for Advanced Computing

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

Scheduling Computational and Storage Resources on the NRP

COMPUTE CANADA GLOBUS PORTAL

UNICORE Globus: Interoperability of Grid Infrastructures

By Ian Foster. Zhifeng Yun

Galaxy. Daniel Blankenberg The Galaxy Team

FeduShare Update. AuthNZ the SAML way for VOs

Toward Scalable Monitoring on Large-Scale Storage for Software Defined Cyberinfrastructure

Software as a Service Gateways

Galaxy Community Update

StratusLab Cloud Distribution Installation. Charles Loomis (CNRS/LAL) 3 July 2014

OGCE User Guide for OGCE Release 1

ACTIVE MICROSOFT CERTIFICATIONS:

SAML-Based SSO Solution

Web-Based Visualization and Visual Analysis for High-Throughput Genomics. Jeremy Goecks! Computational Biology Institute

Proven video conference management software for Cisco Meeting Server

A More Realistic Way of Stressing the End-to-end I/O System

Building the Modern Research Data Portal. Developer Tutorial

Developing Applications with Networking Capabilities via End-to-End Software Defined Networking (DANCES)

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Galaxy workshop at the Winter School Igor Makunin

Allowing Users to Run Services at the OLCF with Kubernetes

SAML-Based SSO Solution

Part2: Let s pick one cloud IaaS middleware: OpenStack. Sergio Maffioletti

2014 Bond Technology Update Progress of the Technology Network Infrastructure Upgrades Long Range Planning Committee March 4, 2015

SCREAM15 Jetstream Notes

Welcome! Presenters: STFC January 10, 2019

COURSE LISTING. Courses Listed. Training for Cloud with SAP Cloud Platform in Development. 23 November 2017 (08:12 GMT) Beginner.

COURSE LISTING. Courses Listed. Training for Database & Technology with Development in SAP Cloud Platform. 1 December 2017 (22:41 GMT) Beginner

Managing Grid Credentials

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users

Remote & Collaborative Visualization. Texas Advanced Computing Center

The GISandbox: A Science Gateway For Geospatial Computing. Davide Del Vento, Eric Shook, Andrea Zonca

Integrating Apache Mesos with Science Gateways via Apache Airavata

COURSE LISTING. Courses Listed. with SAP Hybris Marketing Cloud. 24 January 2018 (23:53 GMT) HY760 - SAP Hybris Marketing Cloud

XSEDE Campus Bridging Tools Rich Knepper Jim Ferguson

XSEDE Iden ty Management Use Cases

Grid Middleware and Globus Toolkit Architecture

Accessible, Transparent and Reproducible Analysis with Galaxy

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Grid Scheduling Architectures with Globus

Docker and HPE Accelerate Digital Transformation to Enable Hybrid IT. Steven Follis Solutions Engineer Docker Inc.

Transcription:

The Future of Galaxy Nate Coraor galaxyproject.org

Galaxy is... A framework for scientists Enables usage of complicated command line tools Deals with file formats as transparently as possible Provides a rich visualization and visual analytics system

Galaxy is... getgalaxy.org Free, open source software Bring your own compute, storage, tools Maximize privacy and security usegalaxy.org/cloud Galaxy cluster in Amazon EC2 Buy as much compute, storage as you need usegalaxy.org Free, public Galaxy server 3.5 TB of reference data 0.8 PB of user data 4,000+ jobs/day

New Users per Month 1500 1300 1100 900 700 500 300 Jan 2010 Jan 2011 Jan 2012 Jan 2013 Wednesday, July 17, 13

usegalaxy.org data growth +128 cores for NGS/multicore jobs Data quotas implemented...

Total Jobs Completed (count) Jobs Deleted Before Run (% of usegalaxy.org frustration growth 160,000 10% 140,000 9% 120,000 100,000 80,000 60,000 40,000 20,000 0 2008-04 2008-08 2008-12 2009-04 2009-08 2009-12 2010-04 2010-08 2010-12 2011-04 2011-08 2011-12 2012-04 2012-08 2012-12 2013-04 2013-08 8% 7% 6% 5% 4% 3% 2% 1% 0%

Where we are

Where we are going

Where we are going

Where we are going Continuing work with ECSS to submit jobs to disparate XSEDE resources Globus Online endpoint for usegalaxy.org Allow users to utilize their XSEDE allocations directly through usegalaxy.org Display detailed information about queue position and resource utilization

Massive Scale Analysis Improve Galaxy workflow engine and UI We can run workflows on single datasets now What about hundreds or thousands?

Scaling Efforts So many tools and workflows, not enough manpower Focus on building infrastructure to allow community to integrate and share tools, workflows, and best practices Too much data, not enough infrastructure Support greater access to usegalaxy.org public and user data from local and cloud Galaxy instances

Data Exchange A big data store for encouraging data exchange among Galaxy instances Galaxy data mirrored in PSC SLASH2- backed Data Supercell Federation

Establishing an XSEDE Galaxy Gateway XSEDE ECSS Symposium, December 17 2013 Philip Blood Senior Computational Scientist Pittsburgh Supercomputing Center blood@psc.edu

Galaxy Team: james.taylor@taylorlab.org anton@bx.psu.edu nate@bx.psu.edu PSC Team: blood@psc.edu ropelews@psc.edu josephin@psc.edu yanovich@psc.edu rbudden@psc.edu zhihui@psc.edu sergiu@psc.edu 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 18

643 HiSeqs = 6.5 Pb/year 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 19

Using Galaxy to Handle Big Data? Compartmentalized solutions: Private Galaxy installations on Campuses Galaxy installations on XSEDE (e.g. NCGAS) Galaxy installations at other CI/cloud providers (e.g. Globus Genomics) Galaxy on public clouds (e.g. Amazon) 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 20

The Vision: A United Federation of Galaxies Ultimately, we envision that any Galaxy instance (in any lab, not just Galaxy Main) will be able to spawn jobs, access data, and share data on external infrastructure whether this is an XSEDE resource, a cluster of Amazon EC2 machines, a remote storage array, etc. 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 21

A Step Forward: Make Galaxy Main an XSEDE Galaxy Gateway Certain Galaxy Main workflows or tasks will be executed on XSEDE resources Especially, tasks that require HPC, e.g. de-novo assembly applications Velvet (of genome) and Trinity (of transcriptome) to PSC Blacklight (up to 16 TB of coherent shared memory per process) Should be transparent to the user of usegalaxy.org 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 22

Key Problems to Solve Data Migration: Galaxy currently relies on a shared filesystem between the instance host and the execution server to store the reference and user data required by the workflow. This is implemented via NFS. Remote Job Submission: Galaxy job execution currently requires a direct interface with the resource manager on the execution server. 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 23

What We ve Done So Far* Addressing Data Migration Issues Established 10 GigE link between PSC and Penn State Established a common wide area distributed filesystem between PSC and Penn State using SLASH2 (http://quipu.psc.teragrid.org/slash2/) Addressing Remote Job Submission Created a new Galaxy job-running plugin for SSH job submission Incorporated Velvet and Trinity into Galaxy s XML interface Successfully submitted test jobs from Penn State and executed on Blacklight using the data replicated via SLASH2 from Penn State to PSC. *Some of these points will be revisited, since Galaxy is now hosted at TACC 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 24

Galaxy Remote Data Architecture Access is identical from Galaxy Main and PSC to the shared dataset via /galaxys2 SLASH2 file system handles consistency and multiple residency coherency and presence Data Generation and Processing Nodes Local copies are maintained for performance Jobs run on PSC compute resources such as Blacklight, as well as Galaxy Main PSC Data Generation and Processing Nodes /galaxys2 SLASH2 Wide-Area Common File system GalaxyFS Galaxy Main /galaxys2 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 25

Galaxy Main Gateway: What Remains to Be Done (1) Integrate this work with the production public Galaxy site, usegalaxy.org (now hosted at TACC) Dynamic job submission: allowing the selection of appropriate remote or local resources (cores, memory, walltime, etc.) based on individual job requirements (possibly using an Open Grid Services Architecture Basic Execution Service compatible service, such as Unicore) 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 26

What Remains to Be Done (2) Galaxy-controlled data management: to intelligently and efficiently migrate and use data on distributed compute resources Testing various data migration strategies with SLASH2 and other available technologies Further developing SLASH2 to meet Federated Galaxy requirements through recent NSF DIBBs award at PSC Authentication with Galaxy instances: using XSEDE or other credentials, e.g., InCommon/CILogon (see upcoming talk by Indiana) Additional data transfer capabilities in Galaxy: such as IRODS and Globus Online (see upcoming talk on Globus Genomics) 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 27

Eventually: Use These Technologies to Enable Universal Federation 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 28

Appendix Initial Galaxy Data Staging to PSC Underlying SLASH2 Architecture 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 29

Initial Galaxy Data Staging to PSC Transferred 470TB in 21 days from PSU to PSC (average ~22TB/day; peak 40 TB/day) rsync used to initially stage and synchronize subsequent updates Data copy maintained in PSC in /arc file system available from compute nodes Data Generation Nodes Storage Penn State PSC 10gigE link Data SuperCell 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 30

Underlying SLASH2 Architecture Metadata Server (MDS) One at Galaxy Main and one at PSC for performance Converts pathnames to object IDs Schedules updates when copies become inconsistent Consistency protocol to avoid incoherent data Residency and network scheduling policies enforced Clients All other file ops (RENAME, SYMLINK, etc.) I/O servers are very lightweight Can use most backing file systems (ZFS, ext4fs, etc.) READ and WRITE I/O I/O I/O I/O Servers (IOS) (IOS) (IOS) (IOS) Clients are compute resources & dedicated front ends Dataset residency requests issued from administrators and/or users 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 31

Funded by National Science Foundation 1. Large memory clusters for assembly 2. Bioinformatics consulting for biologists 3. Optimized software for better efficiency Collaboration across IU, TACC, SDSC, and PSC. Open for business at: http://ncgas.org

Making it easier for Biologists Computational Skills Common LOW Web interface to NCGAS resources Supports many bioinformatics tools Rare HIGH Available for both research and instruction.

GALAXY.NCGAS.ORG Model Individual projects can get duplicate boxes provided they support it themselves. Virtual box hosting Galaxy.ncgas.org The host for each tool is configured individually NCGAS establishes tools, hardens them, and moves them into production. Quarry Mason Archive Data Capacitor

Moving Forward Your Friendly Neighborhood Sequencing Center 100 Gbps NCGAS Mason (Free for NSF users) Your Friendly Neighborhood Sequencing Center Globus On-line and other tools Data Capacitor NO data storage Charges Lustre WAN File System Other NCGAS XSEDE Resources IU POD (12 cents per core hour) Your Friendly Neighborhood Sequencing Center 10 Gbps Optimized Software

Core Hours 4500 NCGAS Galaxy Usage: 2013 4000 3500 3000 2500 2000 1500 1000 500 0 1-Jan 1-Feb 1-Mar 1-Apr 1-May 1-Jun 1-Jul 1-Aug 1-Sep 1-Oct 1-Nov Month

CILogon Authentication for Galaxy Dec. 17, 2013

Goals and Approaches NCGAS Authentication Requirements: XSEDE users can authenticate with NCGAS Galaxy through InCommon credentials. Only NCGAS authorized users can authenticate and use the resource. CILogon Service (http://www.cilogon.org) allows users to authenticate with their home organization and obtain a certificate for secure access to CyberInfrastructure. It supports MyProxy OAuth protocol for certificate delegation to enable science gateways to access CI on user s behalf. Incorporate CILogon as external user authentication for Galaxy, with home-brewed simple authorization mechanism.

Technical Challenges CILogon OAuth client implementation is Java while Galaxy is Python; Python lacks full featured OAuth libraries supporting RSA-SHA1 signature method required by CILogon's OAuth interface. Once authenticated through CILogon, remote username needs to be forwarded to Galaxy via Apache proxy; Additional authorization required for CILogon authenticated users; Some of the default CILogon IdPs including OpenID providers (Google, Paypal, Verisign) are not desired.

Architecture Authentication Apache Web Server HTTP_COOKIE PHP CILogon OAuth Client

Technical Highlights PHP (non Java) implementation of CILogon OAuth Client. Configure Apache proxy to Galaxy: Enable Galaxy external user authentication (universe_wsgi.ini); Configure Apache for proxy forwarding; (httpd ssl.conf); Configure Apache for CILogon authentication with HTTP_COOKIE rewrite; (httpd ssl.conf) Customized NCGAS Skin limiting IdP to InCommon academic. PHP implementation of simple file-based user authorization. Lightweight, packaged for general Galaxy installation. Open source and more details at: http://sourceforge.net/p/ogce/svn/head/tree/galaxy/

Demo https://galaxy.ncgas.org

Experiences in building a nextgeneration sequencing analysis service using Galaxy, Globus, and Amazon Web Services Ravi K Madduri Argonne National Lab and University of Chicago

Globus Genomics Architecture www.globus.org/genomics

Globus Genomics Solution Description Integrated Identity management, Group management and Data movement using Globus Computational profiles for various analysis tools Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure Glusterfs as a shared file system between head nodes and compute nodes Provisioned I/O on EBS www.globus.org/genomics

Globus Genomics Usage www.globus.org/genomics

Example User Cox Lab omputation Institute, University of Chicago, Chicago, IL, USA. 2 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, U 3 Section Genetic Medicine, University of Chicago, Chicago, IL. Challenges in Next-Gen Sequencing Analysis Parallel Workflows on Globus Genomics www.globus.org/genomics High Performance, Reusable Consensus

Globus Genomics Pricing www.globus.org/genomics

Acknowledgments This work was supported in part by the NIH through the NHLBI grant: The Cardiovascular Research Grid (R24HL085343) and by the U.S. Department of Energy under contract DE-AC02-06CH11357. We are grateful to Amazon, Inc., for an award of Amazon Web Services time that facilitated early experiments. The Globus Genomics and Globus Online teams at University of Chicago and Argonne National Laboratory www.globus.org/genomics

For more information More information on Globus Genomics and to sign up: www.globus.org/genomics More information on Globus Online: www.globus.org Questions? Thank you! www.globus.org/genomics