Chris Dwan - Bioteam

Similar documents
Trends and Observations, October Christopher Dwan

RESEARCH DATA DEPOT AT PURDUE UNIVERSITY

Data Centers and Cloud Computing

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

The Future of Virtualization: The Virtualization of the Future

Data Centers and Cloud Computing. Data Centers

Best Practices for designing Farms and Clusters

Cloud Computing. What is cloud computing. CS 537 Fall 2017

High Performance and Cloud Computing (HPCC) for Bioinformatics

Introduction to MapReduce

Renovating your storage infrastructure for Cloud era

Approaching the Petabyte Analytic Database: What I learned

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

Next-Generation Cloud Platform

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Low-cost BYO Mass Storage Project

Cloud computing and Citrix C3 Last update 28 July 2009 Feedback welcome -- Michael. IT in the Clouds. Dr Michael Harries Citrix Labs

Case Study: Aurea Software Goes Beyond the Limits of Amazon EBS to Run 200 Kubernetes Stateful Pods Per Host

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

High Performance and Cloud Computing (HPCC) for Bioinformatics

HPC Growing Pains. IT Lessons Learned from the Biomedical Data Deluge

The Desired State. Solving the Data Center s N-Dimensional Challenge

Top Trends in DBMS & DW

Demystifying the Cloud With a Look at Hybrid Hosting and OpenStack

Flash Decisions: Which Solution is Right for You?

Amazon Web Services. For Government, Education, and Nonprofit Organizations

4/28/2014. Defining A Replacement Cycle for Your Association. Introductions. Introductions. April Executive Director, Idealware. Idealware.

Don t Run out of Power: Use Smart Grid and Cloud Technology

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars

Public Cloud Leverage For IT/Business Alignment Business Goals Agility to speed time to market, adapt to market demands Elasticity to meet demand whil

Windows Servers In Microsoft Azure

Imperial College London. Simon Burbidge 29 Sept 2016

Embedded Technosolutions

Simplifying the Datacenter with Hyper-Convergence. Bob O Donnell, Founder and Chief Analyst

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald

E-Guide WHAT WINDOWS 10 ADOPTION MEANS FOR IT

Cloud Computing Briefing Presentation. DANU

Intro to Software as a Service (SaaS) and Cloud Computing

Foundations of Data Engineering

The Case for Virtualizing Your Oracle Database Deployment

HPC learning using Cloud infrastructure

Banking Technology Trends

Most likely, your organization is not in the business of running data centers, yet a significant amount of time & money is spent doing just that.

MySQL in the Cloud Tricks and Tradeoffs

Architecting for Data Center Agility. Ted Ritter Senior Research Analyst

2016 All Rights Reserved

I'm Andy Glover and this is the Java Technical Series of. the developerworks podcasts. My guest is Brian Jakovich. He is the

Infrastructure Requirements for Discovery Research. Chris Dagdigian 2009 PRISM Forum

Cloud Bursting: Top Reasons Your Organization will Benefit. Scott Jeschonek Director of Cloud Products Avere Systems

Managing CAE Simulation Workloads in Cluster Environments

Breaking Through the Cloud: A LABORATORY GUIDE TO CLOUD COMPUTING

Office 365 Business The Microsoft Office you know, powered by the cloud.

High Performance Computing Data Management. Philippe Trautmann BDM High Performance Computing Global Research

Defining Software Defined Storage Laz Vekiarides Entrepreneur and Technology Executive

Part 2: Computing and Networking Capacity (for research and instructional activities)

Large-Scale Data Engineering. Overview and Introduction

SimpliVity Best of Both Worlds

YOUR CONDUIT TO THE CLOUD

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

CLOUD COMPUTING ARTICLE. Submitted by: M. Rehan Asghar BSSE Faizan Ali Khan BSSE Ahmed Sharafat BSSE

Technology Challenges for Clouds. Henry Newman CTO Instrumental Inc

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

EMC ISILON HARDWARE PLATFORM

How to Pick SQL Server Hardware

REBIT PARTNER PROGRAM. Overview and Guide

Understanding Managed Services

You Can t Move Forward Unless You Can Roll Back. By: Michael Black

Provisioning with SUSE Enterprise Storage. Nyers Gábor Trainer &

SKA. data processing, storage, distribution. Jean-Marc Denis Head of Strategy, BigData & HPC. Oct. 16th, 2017 Paris. Coyright Atos 2017

High Performance Oracle Database in a Flash Sumeet Bansal, Principal Solutions Architect

Performance Is Not a Commodity Nathan Day Chief Scientist and Co - Founder

Windows 2000 Flavors Windows 200 ws 0 Profess 0 P ional Windows 2000 Server Windows 200 ws 0 Advan 0 A ced Server Windows 2000 Datacen ter Server 2

Interface Trends for the Enterprise I/O Highway

Desktop Virtualization: What Windows Managers Should Know

White Paper. Low Cost High Availability Clustering for the Enterprise. Jointly published by Winchester Systems Inc. and Red Hat Inc.

Get Smart about Backup & Recovery

Marketing Lens Marketing Lens Fast Track Implementation Plan Marketing

A Robust, Flexible Platform for Expanding Your Storage without Limits

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Distributed System. Gang Wu. Spring,2019

Business Benefits of Policy Based Data De-Duplication Data Footprint Reduction with Quality of Service (QoS) for Data Protection

SOUTH DAKOTA BOARD OF REGENTS HOUSING AND AUXILIARY FACILITIES SYSTEM REVENUE BONDS, SERIES 2015 Due Diligence Document Request List and Questionnaire

SEMINAR. Achieve 100% Backup Success! Achieve 100% Backup Success! Today s Goals. Today s Goals

The Best of Times/The Worst of Times

Field Update Expanded Deduplication Sizing Guidelines. Oct 2015

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

LEVERAGING A PERSISTENT HARDWARE ARCHITECTURE

Hva er den reelle business-verdien ved software definert datasenter (SDDC) Kjell-Einar Anderssen Country Manager - Norway

ECONOMICAL, STORAGE PURPOSE-BUILT FOR THE EMERGING DATA CENTERS. By George Crump

Amazon Elastic Cloud (EC2): An Easy Route to Your Own Dedicated Webserver

AtoS IT Solutions and Services. Microsoft Solutions Summit 2012

De BiG Grid e-infrastructuur digitaal onderzoek verbonden

Plenary Keynote. Chris Dagdigian. Bio IT World CONFERENCE & EXPO 09

EMC ACADEMIC ALLIANCE

Protected Environment at CHPC. Sean Igo Center for High Performance Computing September 11, 2014

Server Virtualization and Optimization at HSBC. John Gibson Chief Technical Specialist HSBC Bank plc

CREATIVITY MAKES THE DIFFERENCE

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Linux Clusters for High- Performance Computing: An Introduction

Transcription:

Chris Dwan - Bioteam Scientists with production HPC skills Bridging the gap between informatics & IT Vendor & technology agnostic A resource for labs and workgroups that don t have their own supercomputing centers and IT empires Various levels of engagement with many clients Gov/EDU/Biotech/Pharma/Fortune-20 clients Work with lots of smart people on common problems Tutorial at Edinburgh in 2002 good to be back.

Disclaimer Most BioTeam clients don t have 7 figure IT budgets, Petabyte SANs and dedicated datacenters Most of these problems: Simply don t exist for the largest Bio-HPC centers Simply don t matter to the nationally funded Grid projects.

Capital G GRID Computing Remember the promise? Utility computing! Like turning on a tap! Multi-site? No problem! Multi-entity? No problem! Infinite capacity on demand! GRID Facts in 2008: Still a trainwreck for all but the showcase sites At least the vendor FUD & empty press releases have died down Only a tiny number of showpiece sites have the resources to do GRID computing for real

Observed Trends: Clusters: The small cluster market is going away 2-8 node workgroup/lab clusters will be replaced by SMP boxes with multi-core CPUs Storage: Same in 2008 as in 2006 Unhappy technology tradeoffs Backups: The exotic vendors offer blazing speed and a few features The mainstream vendors exclusively focused on enterprise What I need: Massive scaling, decent speed & grab bag of enterprise features 2006: Backup products not keeping up with daily advances in storage capacity promoted by vendors 2008: On its way to becoming a sick joke

Observed Trends: Software Molecular chemists / CFD folks / single purpose shops know what they want, and how to buy it. Despite my best efforts, BLAST is still the state of the art for lots of people. Lots of demand in 2007 for single purpose systems designed to run Phylogeny codes PAUP, MrBayes

In-row chilling 1024 Core cluster @ Emory University

Sealed hot/cold isle enclosures

Liebert XDO Overhead Cooling Site: Institute for Computational Biomedicine; Wiell Cornell Medical College

Next Generation Sequencing New chemistry: Removes the read length limitation. $10 6 to buy the instrument Each run : 5 x 10 8 base pairs 2-3 days Less than $7,000 Up to 2TB of raw data QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. 4 vendors, this is 1st revision.

Terrifying: Terabyte Instruments Is this your future? Multi-terabyte storage resources in every wet lab? Tough decisions ahead Centralized vs. decentralized data capture & movement This will effect everyone doing HPC Bio IT DO NOT WANT!

The Data Problem Primary instrument data (images): ~2TB / day Get used to peta, then exa. Cheaper to re-sequence than to store? Even if you can store them, TB are still heavy Sneakernet lives. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Sequence and quality data: 100 MB / day (manageable) Analysis will have to be local, and accessible to the lab scientists. What will NCBI do?

Next-gen Sequencing Vendors want lock-in, scientists do not Initial data processing may keep up, re-processing will not C.f: Microarrays This is the first generation. Core assumptions are still fluid. Opens the HPC market to a whole bunch of new users (physicians, security, environmental monitoring, )

Wikilims: Next Gen Data Store Meta data wiki File data raid

Automatic data capture - R Most structured content can be captured and recorded by programs as it is generated

Launch an assembly Launch an assembly on the cluster

Potential trend: Data Triage In 2007 first decisions to not store primary data In the past Always keep all data, essentially forever Excuses: It costs to much to repeat the experiment Experiment can t be repeated (imaging, microscopy) It s just too horrible to think about Moving forward (2008 and beyond) Expect cost/benefit discussions among IT and scientific staff What data really needs to be kept? (Primary vs. Derived data) In what cases is it actually be cheaper to rerun the experiment? MAID - Massive Array of Idle Disks

Amazon EC2 for Bio Apps Thanks Dr. Papadopoulos! BioTeam is: Enthusiastic about Amazon EC2 & S3 Considering building EC2-aware products Desktop, cluster & standalone GUIs BioTeam has: MPIBLAST running on EC2 MrBayes-MPI running on EC2 Cross platform GUIs for both applications Generic Sun Grid Engine EC2 images (in development) Storage for both coming from within Amazon S3 Many more apps on the way

Amazon EC2 for Bio Apps Why EC2? The economics are compelling One month of serious experimentation: $9.00 USD billed to credit card Various money making approaches Flexible pricing allows reselling & revenue sharing I can create a EC2 image and add my own fees on top to cover development and support costs I don t need your credit card Amazon handles all transactions & billing

runblast.com

Conclusions Data is the problem Storage and Backups are not keeping pace with requirements This is forcing scientists to revisit core assumptions CPUs are not as big a problem Clusters are commodity Cloud augments one-off needs. Smaller labs are rolling their own solutions As usual, the biggest problems are social / political.