Extreme scripting and other adventures in data-intensive computing

Size: px
Start display at page:

Download "Extreme scripting and other adventures in data-intensive computing"

Transcription

1 Extreme scripting and other adventures in data-intensive computing Ian Foster Allan Espinosa, Ioan Raicu, Mike Wilde, Zhao Zhang Computation Institute Argonne National Lab & University of Chicago

2 How data analysis happens at dataintensive computing workshops 2

3 How data analysis really happens in scientific laboratories 3 % foo file1 > file2 % bar file2 > file3 % foo file1 bar > file3 % foreach f (f1 f2 f3 f4 f5 f6 f7 f100) foreach? foo $f.in bar > $f.out foreach? end % % Now where on earth is f98.out, and how did I generate it again? Now: command not found. %

4 Extreme scripting 4 Many activities Numerous files Complex data Data dependencies Many programs Preserving file system semantics, ability to call arbitrary executables Complex scripts Simple scripts Small computers Swift Big computers Many processors Storage hierarchy Failure Heterogeneity

5 Functional magnetic resonance imaging (fmri) data analysis 5

6 AIRSN program definition 6 (Run snr) functional ( Run r, NormAnat a, } Air shrink ) { Run yrorun = reorientrun( r, "y" ); Run rorun = reorientrun( yrorun, "x" ); Volume std = rorun[0]; Run rndr = random_select( rorun, 0.1 ); AirVector rndairvec = align_linearrun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedrndr = reslicerun( rndr, rndairvec, "o", "k" ); Volume meanrand = softmean( reslicedrndr, "y", "null" ); Air mnqaair = alignlinear( a.nhires, meanrand, 6, 1000, 4, "81 3 3" ); Warp boldnormwarp = combinewarp( shrink, a.awarp, mnqaair ); Run nr = reslice_warp_run( boldnormwarp, rorun ); Volume meanall = strictmean( nr, "y", "null" ) Volume boldmask = binarize( meanall, "y" ); snr = gsmoothrun( nr, boldmask, "6 6 6" ); (Run or) reorientrun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); } }

7 Many many tasks: Identifying potential drug targets 7 Protein target(s) x 2M+ ligands Benoit Roux et al.

8 PDB protein descriptions ZINC 3-D structures 1 protein (1MB) 6 GB 2M structures (6 GB) Manually prep DOCK6 rec file DOCK6 Receptor (1 per protein: defines pocket to bind to) Manually prep FRED rec file FRED Receptor (1 per protein: defines pocket to bind to) NAB Script Template BuildNABScript NAB script parameters (defines flexible residues, #MDsteps) 8 FRED start DOCK6 ~4M x 60s x 1 cpu ~60K cpu-hrs NAB Script Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript Select best ~5K Select best ~5K Amber Select best ~500 ~10K x 20m x 1 cpu ~3K cpu-hrs Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript report GCMC end ligands ~500 x 10hr x 100 cpu ~500K cpu-hrs complexes For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years)

9 9

10 10 IBM BG/P 570 Teraflop/s, 164,000 cores, 80 TB

11 11 DOCK on BG/P: ~1M tasks on 119,000 CPUs Ioan Raicu et al. Time (sec) cores tasks Elapsed time: 7257 sec Compute time: CPU years Average task: 667 sec Relative efficiency 99.7% (from 16 to 32 racks) Utilization: 99.6% sustained, 78.3% overall

12 Managing 160,000 cores 12 Falkon High-speed local disk Slower shared storage

13 Chirp (multicast) Large dataset Global file system ZOID IFS IFS seg ZOID on I/O node IFS compute node Staging Torus and tree interconnects CN-striped intermediate file system IFS seg IFS compute node Scaling Posix to petascale 13 Intermediate MosaStore (striping) LFS Compute node (local datasets)... LFS Compute node (local datasets) Local

14 Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors 14

15 Provisioning for data-intensive workloads 15 Example: on-demand stacking of arbitrary locations within ~10TB sky survey Challenges Random data access Much computing Time-varying load Solution Dynamic acquisition of compute & storage Data diffusion S = Sloan Data Ioan Raicu

16 Sine workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node 16 Ioan Raicu

17 Same scenario, but with dynamic resource provisioning 17

18 GPFS Data diffusion sine-wave workload: Summary 5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP 1.80 hrs, ~25Gb/s, 361 CPU hrs 18 DD+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs

19 19 Data-intensive Computation Institute: Example applications Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics

20 20

21 Folker Meyer, Computation Institute BLAST On EC2, US$ $300,000 Sequencing outpaces Moore s law Bioinformatics Sequencing $240,000 $300,000 $600,000 $900, $30,000 $120, Solexa Next-gen Solexa $7,000 $3,000 $3, Gigabases

22 Data-intensive Computation Institute: Hardware PADS: Petascale Active Data Store (NSF MRI) 22 Diverse data sources Data ingest 1000 TB tape backup 500 TB reliable storage (data & metadata) P A D S 180 TB, 180 GB/s 17 Top/s analysis Dynamic provisioning Parallel analysis Remote access Diverse users Offload to remote data centers

23 Data-intensive Computation Institute: Software HPC systems software (MPICH, PVFS, ZeptOS) 23 Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, cagrid, gravi) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management (Workspace Service) Distributed data management (GridFTP, etc.)

24 Data-intensive computing is an end-to-end problem 24 Low Chaos Agreement about outcomes Zone of complexity High Plan and control High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996 Low

25 We need to function in the zone of complexity 25 Low Chaos Agreement about outcomes High Plan and control High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996 Low

26 The Grid paradigm Principles and mechanisms for dynamic virtual organizations Leverage service oriented architecture Loose coupling of data and services Open software, architecture Computer science Physics Astronomy Engineering Biology Biomedicine Healthcare

27 27 As of Oct 19, 2008: 122 participants 105 services 70 data 35 analytical

28 Multi-center clinical cancer trials image capture and review 28 (Center for Health Informatics)

29 29 Summary Extreme scripting offers the potential for easy scaling of proven working practices Interesting technical problems relating to programming and I/O models Many wonderful applications Data-intensive computing is an end-to-end problem Data generation, integration, analysis, etc., is a continuous, loosely coupled process

30 Thank you! Computation Institute

Grid, cloud, and science: Accelerating discovery. A View and Practice from University of Chicago

Grid, cloud, and science: Accelerating discovery. A View and Practice from University of Chicago Grid, cloud, and science: Accelerating discovery A View and Practice from University of Chicago Ian Foster Presented by Ioan Raicu Computation Institute Argonne National Lab & University of Chicago April

More information

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times Measured in operations per month or years 2 Bridge the gap

More information

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short periods of time Usually requires low latency interconnects

More information

Ian Foster, An Overview of Distributed Systems

Ian Foster, An Overview of Distributed Systems The advent of computation can be compared, in terms of the breadth and depth of its impact on research and scholarship, to the invention of writing and the development of modern mathematics. Ian Foster,

More information

Arguably one of the most fundamental discipline that touches all other disciplines and people

Arguably one of the most fundamental discipline that touches all other disciplines and people The scientific and mathematical approach in information technology and computing Started in the 1960s from Mathematics or Electrical Engineering Today: Arguably one of the most fundamental discipline that

More information

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Falkon, a Fast and Light-weight task execution framework for Clusters, Grids, and Supercomputers Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago In Collaboration

More information

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Running 1 Million Jobs in 10 Minutes via the Falkon Fast and Light-weight Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago In Collaboration with: Ian Foster,

More information

A Notation and System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data

A Notation and System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data Zhao, Y., Dobson, J., Foster, I., Moreau, L., Wilde, M., A Notation and System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data, SIGMOD Record, September 2005. A Notation and

More information

A Data Diffusion Approach to Large Scale Scientific Exploration

A Data Diffusion Approach to Large Scale Scientific Exploration A Data Diffusion Approach to Large Scale Scientific Exploration Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Joint work with: Yong Zhao: Microsoft Ian Foster:

More information

Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers

Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Collaborators:

More information

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Segregated storage and compute NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Co-located storage and compute HDFS, GFS Data

More information

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Segregated storage and compute NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Co-located storage and compute HDFS, GFS Data

More information

Workflow languages and systems

Workflow languages and systems Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident

More information

Overview Past Work Future Work. Motivation Proposal. Work-in-Progress

Overview Past Work Future Work. Motivation Proposal. Work-in-Progress Overview Past Work Future Work Motivation Proposal Work-in-Progress 2 HPC: High-Performance Computing Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

Extreme-scale scripting: Opportunities for large taskparallel applications on petascale computers

Extreme-scale scripting: Opportunities for large taskparallel applications on petascale computers Extreme-scale scripting: Opportunities for large taskparallel applications on petascale computers Michael Wilde, Ioan Raicu, Allan Espinosa, Zhao Zhang, Ben Clifford, Mihael Hategan, Kamil Iskra, Pete

More information

Case Studies in Storage Access by Loosely Coupled Petascale Applications

Case Studies in Storage Access by Loosely Coupled Petascale Applications Case Studies in Storage Access by Loosely Coupled Petascale Applications Justin M Wozniak and Michael Wilde Petascale Data Storage Workshop at SC 09 Portland, Oregon November 15, 2009 Outline Scripted

More information

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course?

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course? Ioan Raicu More information at: http://www.cs.iit.edu/~iraicu/ Everyone else Background? What do you want to get out of this course? 2 Data Intensive Computing is critical to advancing modern science Applies

More information

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Today (2014):

More information

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago DSL Seminar November st, 006 Analysis

More information

Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming

Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming Zhao Zhang +, Allan Espinosa *, Kamil Iskra #, Ioan Raicu *, Ian Foster #*+, Michael Wilde #+ + Computation Institute,

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

Ioan Raicu. Distributed Systems Laboratory Computer Science Department University of Chicago

Ioan Raicu. Distributed Systems Laboratory Computer Science Department University of Chicago The Quest for Scalable Support of Data Intensive Applications in Distributed Systems Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago In Collaboration with: Ian

More information

Forming an ad-hoc nearby storage, based on IKAROS and social networking services

Forming an ad-hoc nearby storage, based on IKAROS and social networking services Forming an ad-hoc nearby storage, based on IKAROS and social networking services Christos Filippidis1, Yiannis Cotronis2 and Christos Markou1 1 Institute of Nuclear & Particle Physics, NCSR Demokritos,

More information

Today (2010): Multicore Computing 80. Near future (~2018): Manycore Computing Number of Cores Processing

Today (2010): Multicore Computing 80. Near future (~2018): Manycore Computing Number of Cores Processing Number of Cores Manufacturing Process 300 250 200 150 100 50 0 2004 2006 2008 2010 2012 2014 2016 2018 100 Today (2010): Multicore Computing 80 1~12 cores commodity architectures 70 60 80 cores proprietary

More information

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

Data Management in Parallel Scripting

Data Management in Parallel Scripting Data Management in Parallel Scripting Zhao Zhang 11/11/2012 Problem Statement Definition: MTC applications are those applications in which existing sequential or parallel programs are linked by files output

More information

Assistant Professor at Illinois Institute of Technology (CS) Director of the Data-Intensive Distributed Systems Laboratory (DataSys)

Assistant Professor at Illinois Institute of Technology (CS) Director of the Data-Intensive Distributed Systems Laboratory (DataSys) Current position: Assistant Professor at Illinois Institute of Technology (CS) Director of the Data-Intensive Distributed Systems Laboratory (DataSys) Guest Research Faculty, Argonne National Laboratory

More information

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team Isilon: Raising The Bar On Performance & Archive Use Cases John Har Solutions Product Manager Unstructured Data Storage Team What we ll cover in this session Isilon Overview Streaming workflows High ops/s

More information

The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems

The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems Ioan Raicu, 1 Ian T. Foster, 1,2,3 Yong Zhao 4 Philip Little, 5 Christopher M. Moretti, 5 Amitabh Chaudhary, 5 Douglas

More information

Assistant Professor at Illinois Institute of Technology (CS) Director of the Data-Intensive Distributed Systems Laboratory (DataSys)

Assistant Professor at Illinois Institute of Technology (CS) Director of the Data-Intensive Distributed Systems Laboratory (DataSys) Current position: Assistant Professor at Illinois Institute of Technology (CS) Director of the Data-Intensive Distributed Systems Laboratory (DataSys) Guest Research Faculty, Argonne National Laboratory

More information

System Software for Big Data and Post Petascale Computing

System Software for Big Data and Post Petascale Computing The Japanese Extreme Big Data Workshop February 26, 2014 System Software for Big Data and Post Petascale Computing Osamu Tatebe University of Tsukuba I/O performance requirement for exascale applications

More information

Storage for HPC, HPDA and Machine Learning (ML)

Storage for HPC, HPDA and Machine Learning (ML) for HPC, HPDA and Machine Learning (ML) Frank Kraemer, IBM Systems Architect mailto:kraemerf@de.ibm.com IBM Data Management for Autonomous Driving (AD) significantly increase development efficiency by

More information

Metadata Ingestion and Processinng

Metadata Ingestion and Processinng biomedical and healthcare Data Discovery Index Ecosystem Ingestion and Processinng Jeffrey S. Grethe, Ph.D. 2017 BioCADDIE All Hands Meeting prototype Ingestion Indexing Repositories Ingestion ElasticSearch

More information

Rutgers Discovery Informatics Institute (RDI2)

Rutgers Discovery Informatics Institute (RDI2) Rutgers Discovery Informatics Institute (RDI2) Manish Parashar h+p://rdi2.rutgers.edu Modern Science & Society Transformed by Compute & Data The era of Extreme Compute and Big Data New paradigms and prac3ces

More information

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

irods at TACC: Secure Infrastructure for Open Science Chris Jordan irods at TACC: Secure Infrastructure for Open Science Chris Jordan What is TACC? Texas Advanced Computing Center Cyberinfrastructure Resources for Open Science University of Texas System 9 Academic, 6

More information

Life Sciences Oracle Based Solutions. June 2004

Life Sciences Oracle Based Solutions. June 2004 Life Sciences Oracle Based Solutions June 2004 Overview of Accelrys Leading supplier of computation tools to the life science and informatics research community: Bioinformatics Cheminformatics Modeling/Simulation

More information

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research Computer Science Section Computational and Information Systems Laboratory National Center for Atmospheric Research My work in the context of TDD/CSS/ReSET Polynya new research computing environment Polynya

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

Revealing Applications Access Pattern in Collective I/O for Cache Management

Revealing Applications Access Pattern in Collective I/O for Cache Management Revealing Applications Access Pattern in for Yin Lu 1, Yong Chen 1, Rob Latham 2 and Yu Zhuang 1 Presented by Philip Roth 3 1 Department of Computer Science Texas Tech University 2 Mathematics and Computer

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Warehouse- Scale Computing and the BDAS Stack

Warehouse- Scale Computing and the BDAS Stack Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,

More information

Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?

Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems? Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems? Dr. William Kramer National Center for Supercomputing Applications, University of Illinois Where these views come from Large

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information

HPC Storage Use Cases & Future Trends

HPC Storage Use Cases & Future Trends Oct, 2014 HPC Storage Use Cases & Future Trends Massively-Scalable Platforms and Solutions Engineered for the Big Data and Cloud Era Atul Vidwansa Email: atul@ DDN About Us DDN is a Leader in Massively

More information

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Page 1 of 5 1 Year 1 Proposal Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Year 1 Progress Report & Year 2 Proposal In order to setup the context for this progress

More information

Scalable Parallel Scripting for Scientific Computing

Scalable Parallel Scripting for Scientific Computing SWIFT Scalable Parallel Scripting for Scientific Computing Researchers at the University of Chicago and Argonne National Laboratory have been extending the timetested programming technique of scripting

More information

Analytics in the cloud

Analytics in the cloud Analytics in the cloud Dow we really need to reinvent the storage stack? R. Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit Sarkar, Mansi Shah, Renu Tewari Image courtesy NASA

More information

Clouds: An Opportunity for Scientific Applications?

Clouds: An Opportunity for Scientific Applications? Clouds: An Opportunity for Scientific Applications? Ewa Deelman USC Information Sciences Institute Acknowledgements Yang-Suk Ki (former PostDoc, USC) Gurmeet Singh (former Ph.D. student, USC) Gideon Juve

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University Data Intensive Scalable Computing Thanks to: Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Big Data Sources: Seismic Simulations Wave propagation during an earthquake Large-scale

More information

The Blue Water s File/Archive System. Data Management Challenges Michelle Butler

The Blue Water s File/Archive System. Data Management Challenges Michelle Butler The Blue Water s File/Archive System Data Management Challenges Michelle Butler (mbutler@ncsa.illinois.edu) NCSA is a World leader in deploying supercomputers and providing scientists with the software

More information

Parallel Storage Systems for Large-Scale Machines

Parallel Storage Systems for Large-Scale Machines Parallel Storage Systems for Large-Scale Machines Doctoral Showcase Christos FILIPPIDIS (cfjs@outlook.com) Department of Informatics and Telecommunications, National and Kapodistrian University of Athens

More information

IBM Spectrum Scale IO performance

IBM Spectrum Scale IO performance IBM Spectrum Scale 5.0.0 IO performance Silverton Consulting, Inc. StorInt Briefing 2 Introduction High-performance computing (HPC) and scientific computing are in a constant state of transition. Artificial

More information

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis Ashish Nagavaram, Gagan Agrawal, Michael A. Freitas, Kelly H. Telu The Ohio State University Gaurang Mehta, Rajiv. G. Mayani, Ewa Deelman

More information

Social Informatics Data Grid

Social Informatics Data Grid Social Informatics Data Grid Cyberinfrastructure for Collaborative Research in the Neural, Social and Behavioral Sciences Bennett I. Bertenthal Indiana University bbertent@indiana.edu Infrastructure for

More information

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute Scientific Workflows and Cloud Computing Gideon Juve USC Information Sciences Institute gideon@isi.edu Scientific Workflows Loosely-coupled parallel applications Expressed as directed acyclic graphs (DAGs)

More information

Keywords: many-task computing; MTC; high-throughput computing; resource management; Falkon; Swift

Keywords: many-task computing; MTC; high-throughput computing; resource management; Falkon; Swift Editorial Manager(tm) for Cluster Computing Manuscript Draft Manuscript Number: Title: Middleware Support for Many-Task Computing Article Type: HPDC Special Issue Section/Category: Keywords: many-task

More information

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations Argonne National Laboratory Argonne National Laboratory is located on 1,500

More information

Visualization for Scientists. We discuss how Deluge and Complexity call for new ideas in data exploration. Learn more, find tools at layerscape.

Visualization for Scientists. We discuss how Deluge and Complexity call for new ideas in data exploration. Learn more, find tools at layerscape. Visualization for Scientists We discuss how Deluge and Complexity call for new ideas in data exploration. Learn more, find tools at layerscape.org Transfer and synchronize files Easy fire-and-forget transfers

More information

Some Reflections on Advanced Geocomputations and the Data Deluge

Some Reflections on Advanced Geocomputations and the Data Deluge Some Reflections on Advanced Geocomputations and the Data Deluge J. A. Rod Blais Dept. of Geomatics Engineering Pacific Institute for the Mathematical Sciences University of Calgary, Calgary, AB www.ucalgary.ca/~blais

More information

The Data Exacell (DXC): Data Infrastructure Building Blocks for Integrating Analytics with Data Management

The Data Exacell (DXC): Data Infrastructure Building Blocks for Integrating Analytics with Data Management The Data Exacell (DXC): Data Infrastructure Building Blocks for Integrating Analytics with Data Management Nick Nystrom, Michael J. Levine, Ralph Roskies, and J Ray Scott Pittsburgh Supercomputing Center

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

HealthGrids: In Search for Sustainable Solutions

HealthGrids: In Search for Sustainable Solutions HealthGrids: In Search for Sustainable Solutions Karl A. Stroetmann MBA PhD FRSM with Alexander Dobrev, Dainis Zegners empirica Communication & Technology Research, Bonn, Germany 1 Contents Definition

More information

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016 The Data exacell DXC J. Ray Scott DXC PI May 17, 2016 DXC Leadership Mike Levine Co-Scientific Director Co-PI Nick Nystrom Senior Director of Research Co-PI Ralph Roskies Co-Scientific Director Co-PI Robin

More information

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster. Overview Both the industry and academia have an increase demand for good policies and mechanisms to

More information

ECE7995 (7) Parallel I/O

ECE7995 (7) Parallel I/O ECE7995 (7) Parallel I/O 1 Parallel I/O From user s perspective: Multiple processes or threads of a parallel program accessing data concurrently from a common file From system perspective: - Files striped

More information

BioGrid Australia - Health Through Information

BioGrid Australia - Health Through Information Images and Oracle Database 11g BioGrid Australia - Health Through Information PRANABH JAIN and NAOMI RAFAEL Presented by Susan Mavris, Oracle Multimedia Agenda Purpose and Description of BioGrid Oracle

More information

GPFS for Life Sciences at NERSC

GPFS for Life Sciences at NERSC GPFS for Life Sciences at NERSC A NERSC & JGI collaborative effort Jason Hick, Rei Lee, Ravi Cheema, and Kjiersten Fagnan GPFS User Group meeting May 20, 2015-1 - Overview of Bioinformatics - 2 - A High-level

More information

EMC VMAX 400K SPC-2 Proven Performance. Silverton Consulting, Inc. StorInt Briefing

EMC VMAX 400K SPC-2 Proven Performance. Silverton Consulting, Inc. StorInt Briefing EMC VMAX 400K SPC-2 Proven Performance Silverton Consulting, Inc. StorInt Briefing EMC VMAX 400K SPC-2 PROVEN PERFORMANCE PAGE 2 OF 10 Introduction In this paper, we analyze all- flash EMC VMAX 400K storage

More information

dan.fay@microsoft.com Scientific Data Intensive Computing Workshop 2004 Visualizing and Experiencing E 3 Data + Information: Provide a unique experience to reduce time to insight and knowledge through

More information

DDN About Us Solving Large Enterprise and Web Scale Challenges

DDN About Us Solving Large Enterprise and Web Scale Challenges 1 DDN About Us Solving Large Enterprise and Web Scale Challenges History Founded in 98 World s Largest Private Storage Company Growing, Profitable, Self Funded Headquarters: Santa Clara and Chatsworth,

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF MULTI-CLOUD COMPUITNG FOR LOOSLY COUPLED MTC APPLICATIONS

PERFORMANCE ANALYSIS AND OPTIMIZATION OF MULTI-CLOUD COMPUITNG FOR LOOSLY COUPLED MTC APPLICATIONS PERFORMANCE ANALYSIS AND OPTIMIZATION OF MULTI-CLOUD COMPUITNG FOR LOOSLY COUPLED MTC APPLICATIONS V. Prasathkumar, P. Jeevitha Assiatant Professor, Department of Information Technology Sri Shakthi Institute

More information

Problems for Resource Brokering in Large and Dynamic Grid Environments

Problems for Resource Brokering in Large and Dynamic Grid Environments Problems for Resource Brokering in Large and Dynamic Grid Environments Cătălin L. Dumitrescu Computer Science Department The University of Chicago cldumitr@cs.uchicago.edu (currently at TU Delft) Kindly

More information

SAS workload performance improvements with IBM XIV Storage System Gen3

SAS workload performance improvements with IBM XIV Storage System Gen3 SAS workload performance improvements with IBM XIV Storage System Gen3 Including performance comparison with XIV second-generation model Narayana Pattipati IBM Systems and Technology Group ISV Enablement

More information

Accelerating Large Scale Scientific Exploration through Data Diffusion

Accelerating Large Scale Scientific Exploration through Data Diffusion Accelerating Large Scale Scientific Exploration through Data Diffusion Ioan Raicu *, Yong Zhao *, Ian Foster #*+, Alex Szalay - {iraicu,yongzh }@cs.uchicago.edu, foster@mcs.anl.gov, szalay@jhu.edu * Department

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Parallel and Distributed File Systems

Parallel and Distributed File Systems CSE 710 Seminar Parallel and Distributed File Systems Tevfik Kosar, Ph.D. Week 1: January 29, 2014 Data Deluge Big Data in Science Scientific data outpaced Moore s Law! Demand for data brings demand for

More information

Professor: Ioan Raicu. TA: Wei Tang. Everyone else

Professor: Ioan Raicu. TA: Wei Tang. Everyone else Professor: Ioan Raicu http://www.cs.iit.edu/~iraicu/ http://datasys.cs.iit.edu/ TA: Wei Tang http://mypages.iit.edu/~wtang6/ Everyone else Background? What do you want to get out of this course? 2 General

More information

Commercial Data Intensive Cloud Computing Architecture: A Decision Support Framework

Commercial Data Intensive Cloud Computing Architecture: A Decision Support Framework Association for Information Systems AIS Electronic Library (AISeL) CONF-IRM 2014 Proceedings International Conference on Information Resources Management (CONF-IRM) 2014 Commercial Data Intensive Cloud

More information

THE conventional architecture of high-performance

THE conventional architecture of high-performance 1 Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations Dongfang Zhao, Ning Liu, Dries Kimpe, Robert Ross, Xian-He Sun, and Ioan Raicu Abstract The

More information

AUTOMATING IBM SPECTRUM SCALE CLUSTER BUILDS IN AWS PROOF OF CONCEPT

AUTOMATING IBM SPECTRUM SCALE CLUSTER BUILDS IN AWS PROOF OF CONCEPT AUTOMATING IBM SPECTRUM SCALE CLUSTER BUILDS IN AWS PROOF OF CONCEPT By Joshua Kwedar Sr. Systems Engineer By Steve Horan Cloud Architect ATS Innovation Center, Malvern, PA Dates: Oct December 2017 INTRODUCTION

More information

Mathematics and Computer Science Division. Department of Agricultural and Biological Engineering

Mathematics and Computer Science Division. Department of Agricultural and Biological Engineering Mathematics and Computer Science Division Department of Science and Technologies University of Naples Parthenope FACE-IT: Earth science workflows made easy with Globus and Galaxy technologies (Provide

More information

Deep Learning mit PowerAI - Ein Überblick

Deep Learning mit PowerAI - Ein Überblick Stephen Lutz Deep Learning mit PowerAI - Open Group Master Certified IT Specialist Technical Sales IBM Cognitive Infrastructure IBM Germany Ein Überblick Stephen.Lutz@de.ibm.com What s that? and what s

More information

Educating a New Breed of Data Scientists for Scientific Data Management

Educating a New Breed of Data Scientists for Scientific Data Management Educating a New Breed of Data Scientists for Scientific Data Management Jian Qin School of Information Studies Syracuse University Microsoft escience Workshop, Chicago, October 9, 2012 Talk points Data

More information

Chapter 1: Introduction to Parallel Computing

Chapter 1: Introduction to Parallel Computing Parallel and Distributed Computing Chapter 1: Introduction to Parallel Computing Jun Zhang Laboratory for High Performance Computing & Computer Simulation Department of Computer Science University of Kentucky

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

Data Life Cycle. Research. Access Collaborate. Acquire. Analyse. Comprehend. Plan. Manage Archive. Publish Reuse

Data Life Cycle. Research. Access Collaborate. Acquire. Analyse. Comprehend. Plan. Manage Archive. Publish Reuse Automated ingest and management Access Collaborate Dataset transfer Databases Web-based file sharing Collaborative sites Acquire Analyse Technical advice Costing Grant assistance Plan Research Data Life

More information

Data Centres in the Virtual Observatory Age

Data Centres in the Virtual Observatory Age Data Centres in the Virtual Observatory Age David Schade Canadian Astronomy Data Centre A few things I ve learned in the past two days There exist serious efforts at Long-Term Data Preservation Alliance

More information

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009 Magellan Project Jeff Broughton NERSC Systems Department Head October 7, 2009 1 Magellan Background National Energy Research Scientific Computing Center (NERSC) Argonne Leadership Computing Facility (ALCF)

More information

IBM Scale Out Network Attached Storage (SONAS) using the Acuo Universal Clinical Platform

IBM Scale Out Network Attached Storage (SONAS) using the Acuo Universal Clinical Platform IBM Scale Out Network Attached Storage (SONAS) using the Acuo Universal Clinical Platform A vendor-neutral medical-archive offering Dave Curzio IBM Systems and Technology Group ISV Enablement February

More information

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading Smart Trading with Cray Systems Agenda: Cray Overview Market Trends & Challenges Mitigating Risk with Deeper

More information

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Using MPI One-sided Communication to Accelerate Bioinformatics Applications Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang (hwang121@vt.edu) Department of Computer Science, Virginia Tech Next-Generation Sequencing (NGS) Data Analysis NGS Data

More information

Moving e-infrastructure into a new era the FP7 challenge

Moving e-infrastructure into a new era the FP7 challenge GARR Conference 18 May 2006 Moving e-infrastructure into a new era the FP7 challenge Mário Campolargo European Commission - DG INFSO Head of Unit Research Infrastructures Example of e-science challenges

More information

Introduction to High Performance Parallel I/O

Introduction to High Performance Parallel I/O Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing

More information

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas Institute of Computer Science (ICS) Foundation for Research and

More information

Life In The Flash Director - EMC Flash Strategy (Cross BU)

Life In The Flash Director - EMC Flash Strategy (Cross BU) 1 Life In The Flash Lane @SamMarraccini, Director - EMC Flash Strategy (Cross BU) CONSTANT 2 Performance = Moore s Law, Or Does It? MOORE S LAW: 100X PER DECADE FLASH Closes The CPU To Storage Gap FLASH

More information

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

Advanced School in High Performance and GRID Computing November Introduction to Grid computing. 1967-14 Advanced School in High Performance and GRID Computing 3-14 November 2008 Introduction to Grid computing. TAFFONI Giuliano Osservatorio Astronomico di Trieste/INAF Via G.B. Tiepolo 11 34131 Trieste

More information

e-infrastructure: objectives and strategy in FP7

e-infrastructure: objectives and strategy in FP7 "The views expressed in this presentation are those of the author and do not necessarily reflect the views of the European Commission" e-infrastructure: objectives and strategy in FP7 National information

More information