The Future of Galaxy Nate Coraor galaxyproject.org
Galaxy is... A framework for scientists Enables usage of complicated command line tools Deals with file formats as transparently as possible Provides a rich visualization and visual analytics system
Galaxy is... getgalaxy.org Free, open source software Bring your own compute, storage, tools Maximize privacy and security usegalaxy.org/cloud Galaxy cluster in Amazon EC2 Buy as much compute, storage as you need usegalaxy.org Free, public Galaxy server 3.5 TB of reference data 0.8 PB of user data 4,000+ jobs/day
New Users per Month 1500 1300 1100 900 700 500 300 Jan 2010 Jan 2011 Jan 2012 Jan 2013 Wednesday, July 17, 13
usegalaxy.org data growth +128 cores for NGS/multicore jobs Data quotas implemented...
Total Jobs Completed (count) Jobs Deleted Before Run (% of usegalaxy.org frustration growth 160,000 10% 140,000 9% 120,000 100,000 80,000 60,000 40,000 20,000 0 2008-04 2008-08 2008-12 2009-04 2009-08 2009-12 2010-04 2010-08 2010-12 2011-04 2011-08 2011-12 2012-04 2012-08 2012-12 2013-04 2013-08 8% 7% 6% 5% 4% 3% 2% 1% 0%
Where we are
Where we are going
Where we are going
Where we are going Continuing work with ECSS to submit jobs to disparate XSEDE resources Globus Online endpoint for usegalaxy.org Allow users to utilize their XSEDE allocations directly through usegalaxy.org Display detailed information about queue position and resource utilization
Massive Scale Analysis Improve Galaxy workflow engine and UI We can run workflows on single datasets now What about hundreds or thousands?
Scaling Efforts So many tools and workflows, not enough manpower Focus on building infrastructure to allow community to integrate and share tools, workflows, and best practices Too much data, not enough infrastructure Support greater access to usegalaxy.org public and user data from local and cloud Galaxy instances
Data Exchange A big data store for encouraging data exchange among Galaxy instances Galaxy data mirrored in PSC SLASH2- backed Data Supercell Federation
Establishing an XSEDE Galaxy Gateway XSEDE ECSS Symposium, December 17 2013 Philip Blood Senior Computational Scientist Pittsburgh Supercomputing Center blood@psc.edu
Galaxy Team: james.taylor@taylorlab.org anton@bx.psu.edu nate@bx.psu.edu PSC Team: blood@psc.edu ropelews@psc.edu josephin@psc.edu yanovich@psc.edu rbudden@psc.edu zhihui@psc.edu sergiu@psc.edu 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 18
643 HiSeqs = 6.5 Pb/year 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 19
Using Galaxy to Handle Big Data? Compartmentalized solutions: Private Galaxy installations on Campuses Galaxy installations on XSEDE (e.g. NCGAS) Galaxy installations at other CI/cloud providers (e.g. Globus Genomics) Galaxy on public clouds (e.g. Amazon) 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 20
The Vision: A United Federation of Galaxies Ultimately, we envision that any Galaxy instance (in any lab, not just Galaxy Main) will be able to spawn jobs, access data, and share data on external infrastructure whether this is an XSEDE resource, a cluster of Amazon EC2 machines, a remote storage array, etc. 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 21
A Step Forward: Make Galaxy Main an XSEDE Galaxy Gateway Certain Galaxy Main workflows or tasks will be executed on XSEDE resources Especially, tasks that require HPC, e.g. de-novo assembly applications Velvet (of genome) and Trinity (of transcriptome) to PSC Blacklight (up to 16 TB of coherent shared memory per process) Should be transparent to the user of usegalaxy.org 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 22
Key Problems to Solve Data Migration: Galaxy currently relies on a shared filesystem between the instance host and the execution server to store the reference and user data required by the workflow. This is implemented via NFS. Remote Job Submission: Galaxy job execution currently requires a direct interface with the resource manager on the execution server. 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 23
What We ve Done So Far* Addressing Data Migration Issues Established 10 GigE link between PSC and Penn State Established a common wide area distributed filesystem between PSC and Penn State using SLASH2 (http://quipu.psc.teragrid.org/slash2/) Addressing Remote Job Submission Created a new Galaxy job-running plugin for SSH job submission Incorporated Velvet and Trinity into Galaxy s XML interface Successfully submitted test jobs from Penn State and executed on Blacklight using the data replicated via SLASH2 from Penn State to PSC. *Some of these points will be revisited, since Galaxy is now hosted at TACC 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 24
Galaxy Remote Data Architecture Access is identical from Galaxy Main and PSC to the shared dataset via /galaxys2 SLASH2 file system handles consistency and multiple residency coherency and presence Data Generation and Processing Nodes Local copies are maintained for performance Jobs run on PSC compute resources such as Blacklight, as well as Galaxy Main PSC Data Generation and Processing Nodes /galaxys2 SLASH2 Wide-Area Common File system GalaxyFS Galaxy Main /galaxys2 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 25
Galaxy Main Gateway: What Remains to Be Done (1) Integrate this work with the production public Galaxy site, usegalaxy.org (now hosted at TACC) Dynamic job submission: allowing the selection of appropriate remote or local resources (cores, memory, walltime, etc.) based on individual job requirements (possibly using an Open Grid Services Architecture Basic Execution Service compatible service, such as Unicore) 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 26
What Remains to Be Done (2) Galaxy-controlled data management: to intelligently and efficiently migrate and use data on distributed compute resources Testing various data migration strategies with SLASH2 and other available technologies Further developing SLASH2 to meet Federated Galaxy requirements through recent NSF DIBBs award at PSC Authentication with Galaxy instances: using XSEDE or other credentials, e.g., InCommon/CILogon (see upcoming talk by Indiana) Additional data transfer capabilities in Galaxy: such as IRODS and Globus Online (see upcoming talk on Globus Genomics) 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 27
Eventually: Use These Technologies to Enable Universal Federation 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 28
Appendix Initial Galaxy Data Staging to PSC Underlying SLASH2 Architecture 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 29
Initial Galaxy Data Staging to PSC Transferred 470TB in 21 days from PSU to PSC (average ~22TB/day; peak 40 TB/day) rsync used to initially stage and synchronize subsequent updates Data copy maintained in PSC in /arc file system available from compute nodes Data Generation Nodes Storage Penn State PSC 10gigE link Data SuperCell 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 30
Underlying SLASH2 Architecture Metadata Server (MDS) One at Galaxy Main and one at PSC for performance Converts pathnames to object IDs Schedules updates when copies become inconsistent Consistency protocol to avoid incoherent data Residency and network scheduling policies enforced Clients All other file ops (RENAME, SYMLINK, etc.) I/O servers are very lightweight Can use most backing file systems (ZFS, ext4fs, etc.) READ and WRITE I/O I/O I/O I/O Servers (IOS) (IOS) (IOS) (IOS) Clients are compute resources & dedicated front ends Dataset residency requests issued from administrators and/or users 2013 Pittsburgh Supercomputing Center 2010 Pittsburgh Supercomputing Center 31
Funded by National Science Foundation 1. Large memory clusters for assembly 2. Bioinformatics consulting for biologists 3. Optimized software for better efficiency Collaboration across IU, TACC, SDSC, and PSC. Open for business at: http://ncgas.org
Making it easier for Biologists Computational Skills Common LOW Web interface to NCGAS resources Supports many bioinformatics tools Rare HIGH Available for both research and instruction.
GALAXY.NCGAS.ORG Model Individual projects can get duplicate boxes provided they support it themselves. Virtual box hosting Galaxy.ncgas.org The host for each tool is configured individually NCGAS establishes tools, hardens them, and moves them into production. Quarry Mason Archive Data Capacitor
Moving Forward Your Friendly Neighborhood Sequencing Center 100 Gbps NCGAS Mason (Free for NSF users) Your Friendly Neighborhood Sequencing Center Globus On-line and other tools Data Capacitor NO data storage Charges Lustre WAN File System Other NCGAS XSEDE Resources IU POD (12 cents per core hour) Your Friendly Neighborhood Sequencing Center 10 Gbps Optimized Software
Core Hours 4500 NCGAS Galaxy Usage: 2013 4000 3500 3000 2500 2000 1500 1000 500 0 1-Jan 1-Feb 1-Mar 1-Apr 1-May 1-Jun 1-Jul 1-Aug 1-Sep 1-Oct 1-Nov Month
CILogon Authentication for Galaxy Dec. 17, 2013
Goals and Approaches NCGAS Authentication Requirements: XSEDE users can authenticate with NCGAS Galaxy through InCommon credentials. Only NCGAS authorized users can authenticate and use the resource. CILogon Service (http://www.cilogon.org) allows users to authenticate with their home organization and obtain a certificate for secure access to CyberInfrastructure. It supports MyProxy OAuth protocol for certificate delegation to enable science gateways to access CI on user s behalf. Incorporate CILogon as external user authentication for Galaxy, with home-brewed simple authorization mechanism.
Technical Challenges CILogon OAuth client implementation is Java while Galaxy is Python; Python lacks full featured OAuth libraries supporting RSA-SHA1 signature method required by CILogon's OAuth interface. Once authenticated through CILogon, remote username needs to be forwarded to Galaxy via Apache proxy; Additional authorization required for CILogon authenticated users; Some of the default CILogon IdPs including OpenID providers (Google, Paypal, Verisign) are not desired.
Architecture Authentication Apache Web Server HTTP_COOKIE PHP CILogon OAuth Client
Technical Highlights PHP (non Java) implementation of CILogon OAuth Client. Configure Apache proxy to Galaxy: Enable Galaxy external user authentication (universe_wsgi.ini); Configure Apache for proxy forwarding; (httpd ssl.conf); Configure Apache for CILogon authentication with HTTP_COOKIE rewrite; (httpd ssl.conf) Customized NCGAS Skin limiting IdP to InCommon academic. PHP implementation of simple file-based user authorization. Lightweight, packaged for general Galaxy installation. Open source and more details at: http://sourceforge.net/p/ogce/svn/head/tree/galaxy/
Demo https://galaxy.ncgas.org
Experiences in building a nextgeneration sequencing analysis service using Galaxy, Globus, and Amazon Web Services Ravi K Madduri Argonne National Lab and University of Chicago
Globus Genomics Architecture www.globus.org/genomics
Globus Genomics Solution Description Integrated Identity management, Group management and Data movement using Globus Computational profiles for various analysis tools Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure Glusterfs as a shared file system between head nodes and compute nodes Provisioned I/O on EBS www.globus.org/genomics
Globus Genomics Usage www.globus.org/genomics
Example User Cox Lab omputation Institute, University of Chicago, Chicago, IL, USA. 2 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, U 3 Section Genetic Medicine, University of Chicago, Chicago, IL. Challenges in Next-Gen Sequencing Analysis Parallel Workflows on Globus Genomics www.globus.org/genomics High Performance, Reusable Consensus
Globus Genomics Pricing www.globus.org/genomics
Acknowledgments This work was supported in part by the NIH through the NHLBI grant: The Cardiovascular Research Grid (R24HL085343) and by the U.S. Department of Energy under contract DE-AC02-06CH11357. We are grateful to Amazon, Inc., for an award of Amazon Web Services time that facilitated early experiments. The Globus Genomics and Globus Online teams at University of Chicago and Argonne National Laboratory www.globus.org/genomics
For more information More information on Globus Genomics and to sign up: www.globus.org/genomics More information on Globus Online: www.globus.org Questions? Thank you! www.globus.org/genomics