Dac-Man: Data Change Management for Scientific Datasets on HPC systems

Size: px
Start display at page:

Download "Dac-Man: Data Change Management for Scientific Datasets on HPC systems"

Transcription

1 Dac-Man: Data Change Management for Scientific Datasets on HPC systems Devarshi Ghoshal Lavanya Ramakrishnan Deborah Agarwal Lawrence Berkeley National Laboratory

2 Motivation Data Releases Storage Resources Compute Resources Scientific Discovery Large scientific datasets are frequently updated limited or no provenance Longer time to scientific discovery scientists delay updating the downstream data products due to complexity Inefficient use of compute and storage resources users often rerun data processing pipelines without understanding the impact of change 2

3 Limitations of Existing Tools Sequentially compare datasets Do not save/reuse change information Generates un-interpretable change results Lack information necessary to assess the impact of data change Unable to quantify change for scientific datasets Do not scale 3

4 Dac-Man: DAta Change MANagement A framework that identifies, captures and manages change in large scientific datasets, and enables plug-in of domain-specific change analysis with minimal user effort Is not a version control system Designed to scale on an HPC system Allows users to efficiently interpret and quantify changes 4

5 Architecture User interfacing components Change tracker: scans and compares data objects; manages data change Query manager: retrieves change information Change and metadata management Indexing area: stores indexes and filesystem metadata Caching area: stages file and data change information 5

6 Dac-Man Indexes Key: File Value: Hash(File data) Value: File Key: Hash(File data) Bi-directional indexing helps Dac-Man identify both data and metadata changes in files MPI workers build the indexes in parallel Saved on filesystem for portability and reuse 6

7 Comparators File comparator recursively compares files and directories, including subdirectories and symbolic links uses indexes to compare the files classifies file changes into different types modified, metadata-only, added/deleted Data comparator compares data within files Data adaptors transform different scientific data formats into Dac- Man records Dac-Man records are a collection of key-value pairs allows external scripts to be used as data comparators 7

8 Dac-Man Cache Dataset-1 Datapath-1 Datapath-2 Cache-id : Cached query Dataset-2 Cache entry: Cache-id Change metadata Similar to a staging area in Git Saves file change metadata, including subdirectory changes Used for improving subsequent change retrieval queries Invalid cache entries are updated by re-indexing and recomputing the changes 8

9 Change Capture Workflow Command-line: dacman diff [options] OLD NEW scan index compare diff crawls data directories, saves directory structures and associated file metadata indexes filesystem objects uses indexes to compare files and save the change results retrieves the changes 9

10 Data Provenance in Dac-Man Metadata information captured by Dac-Man contributes to the provenance of a dataset Tracks data provenance in workflows by correlating data changes between inputs and outputs Enables using provenance information to analyze the impact of changes V1 V2 V3 10

11 Evaluation System NERSC s Cori supercomputer 32 cores per node, 128 GB DDR4 memory Datasets Sloan Digital Sky Survey (SDSS) primarily consists of FITS files multiple data releases with approx. 9.7 million files Fluxnet consists of CSV files approx files Synthetic files contain arbitrary binary data different number of files and amount of data for controlled experiments Tools: Unix diff, Git diff, Python filecmp 11

12 Comparison to Existing Tools Performs 100x better than existing diff tools 12

13 Scalability of Indexing Indexing time reduces as more resources are allocated 13

14 Amount of Data Change Performance is constant irrespective of the amount of change 14

15 Change Retrieval Speeds up change retrieval by caching change information 15

16 Conclusions Provides a scalable framework for identifying and capturing changes in large scientific datasets Generates and shares data change summaries useful for data consumers Identifies, captures and tracks changes of different types and granularities Provides a portable solution to compare remote datasets Provides meaningful change results through the use of domainspecific change metrics Can be integrated with data processing pipelines and streaming datasets for real-time change analysis 16

17 Acknowledgments Deduce Project ( U.S. DOE Contract No. DE-AC02-05CH11231 Program Manager: Rich Carlson Stephen Bailey, Boris Faybishenko, Juliane Mueller, Alex Romosan, Craig Tull 17

Seeking Supernovae in the Clouds: A Performance Study

Seeking Supernovae in the Clouds: A Performance Study Seeking Supernovae in the Clouds: A Performance Study Keith R. Jackson, Lavanya Ramakrishnan, Karl J. Runge, Rollin C. Thomas Lawrence Berkeley National Laboratory Why Do I Care About Supernovae? The rate

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori

More information

Automating Real-time Seismic Analysis

Automating Real-time Seismic Analysis Automating Real-time Seismic Analysis Through Streaming and High Throughput Workflows Rafael Ferreira da Silva, Ph.D. http://pegasus.isi.edu Do we need seismic analysis? Pegasus http://pegasus.isi.edu

More information

Tigres: Template Interfaces for Agile Parallel Data-Intensive Science

Tigres: Template Interfaces for Agile Parallel Data-Intensive Science Tigres: Template Interfaces for Agile Parallel Data-Intensive Science http://tigres.lbl.gov Lavanya Ramakrishnan LRamakrishnan@lbl.gov 1 (CS Biased) View of Workflow Challenges: Gene2Life Molecular Biology

More information

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva. Pegasus Automate, recover, and debug scientific computations. Rafael Ferreira da Silva http://pegasus.isi.edu Experiment Timeline Scientific Problem Earth Science, Astronomy, Neuroinformatics, Bioinformatics,

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales

Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales Margaret Lawson, Jay Lofstead Sandia National Laboratories is a multimission laboratory managed and operated

More information

Managing large-scale workflows with Pegasus

Managing large-scale workflows with Pegasus Funded by the National Science Foundation under the OCI SDCI program, grant #0722019 Managing large-scale workflows with Pegasus Karan Vahi ( vahi@isi.edu) Collaborative Computing Group USC Information

More information

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Jie Li1, Deb Agarwal2, Azure Marty Platform Humphrey1, Keith Jackson2, Catharine van Ingen3, Youngryel Ryu4

More information

SDS: A Framework for Scientific Data Services

SDS: A Framework for Scientific Data Services SDS: A Framework for Scientific Data Services Bin Dong, Suren Byna*, John Wu Scientific Data Management Group Lawrence Berkeley National Laboratory Finding Newspaper Articles of Interest Finding news articles

More information

Case Studies in Storage Access by Loosely Coupled Petascale Applications

Case Studies in Storage Access by Loosely Coupled Petascale Applications Case Studies in Storage Access by Loosely Coupled Petascale Applications Justin M Wozniak and Michael Wilde Petascale Data Storage Workshop at SC 09 Portland, Oregon November 15, 2009 Outline Scripted

More information

The VGG Face Finder (VFF) Engine

The VGG Face Finder (VFF) Engine The VGG Face Finder (VFF) Engine Performs a visual search over a dataset of images with faces Automatically finds images matching your query within the dataset Input can be a text string or an image It

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

ArrayUDF Explores Structural Locality for Faster Scientific Analyses

ArrayUDF Explores Structural Locality for Faster Scientific Analyses ArrayUDF Explores Structural Locality for Faster Scientific Analyses John Wu 1 Bin Dong 1, Surendra Byna 1, Jialin Liu 1, Weijie Zhao 2, Florin Rusu 1,2 1 LBNL, Berkeley, CA 2 UC Merced, Merced, CA Two

More information

Spark and HPC for High Energy Physics Data Analyses

Spark and HPC for High Energy Physics Data Analyses Spark and HPC for High Energy Physics Data Analyses Marc Paterno, Jim Kowalkowski, and Saba Sehrish 2017 IEEE International Workshop on High-Performance Big Data Computing Introduction High energy physics

More information

RENKU - Reproduce, Reuse, Recycle Research. Rok Roškar and the SDSC Renku team

RENKU - Reproduce, Reuse, Recycle Research. Rok Roškar and the SDSC Renku team RENKU - Reproduce, Reuse, Recycle Research Rok Roškar and the SDSC Renku team Renku-Reana workshop @ CERN 26.06.2018 Goals of Renku 1. Provide the means to create reproducible data science 2. Facilitate

More information

Data storage on Triton: an introduction

Data storage on Triton: an introduction Motivation Data storage on Triton: an introduction How storage is organized in Triton How to optimize IO Do's and Don'ts Exercises slide 1 of 33 Data storage: Motivation Program speed isn t just about

More information

irods for Data Management and Archiving UGM 2018 Masilamani Subramanyam

irods for Data Management and Archiving UGM 2018 Masilamani Subramanyam irods for Data Management and Archiving UGM 2018 Masilamani Subramanyam Agenda Introduction Challenges Data Transfer Solution irods use in Data Transfer Solution irods Proof-of-Concept Q&A Introduction

More information

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter CUG 2011, May 25th, 2011 1 Requirements to Reality Develop RFP Select

More information

Scientific Cluster Deployment and Recovery Using puppet to simplify cluster management

Scientific Cluster Deployment and Recovery Using puppet to simplify cluster management Journal of Physics: Conference Series Scientific Cluster Deployment and Recovery Using puppet to simplify cluster management To cite this article: Val Hendrix et al 2012 J. Phys.: Conf. Ser. 396 042027

More information

Engagement With Scientific Facilities

Engagement With Scientific Facilities Engagement With Scientific Facilities Eli Dart, Network Engineer ESnet Science Engagement Lawrence Berkeley National Laboratory Global Science Engagement Panel Internet2 Technology Exchange San Francisco,

More information

Main Points. File layout Directory layout

Main Points. File layout Directory layout File Systems Main Points File layout Directory layout File System Design Constraints For small files: Small blocks for storage efficiency Files used together should be stored together For large files:

More information

Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures

Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Mostofa Patwary 1, Nadathur Satish 1, Narayanan Sundaram 1, Jilalin Liu 2, Peter Sadowski 2, Evan Racah 2, Suren Byna 2, Craig

More information

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster. Overview Both the industry and academia have an increase demand for good policies and mechanisms to

More information

Application and System Memory Use, Configuration, and Problems on Bassi. Richard Gerber

Application and System Memory Use, Configuration, and Problems on Bassi. Richard Gerber Application and System Memory Use, Configuration, and Problems on Bassi Richard Gerber Lawrence Berkeley National Laboratory NERSC User Services ScicomP 13, Garching, Germany, July 17, 2007 NERSC is supported

More information

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017 A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council Perth, July 31-Aug 01, 2017 http://levlafayette.com Necessary and Sufficient Definitions High Performance Computing: High

More information

Grid Computing Systems: A Survey and Taxonomy

Grid Computing Systems: A Survey and Taxonomy Grid Computing Systems: A Survey and Taxonomy Material for this lecture from: A Survey and Taxonomy of Resource Management Systems for Grid Computing Systems, K. Krauter, R. Buyya, M. Maheswaran, CS Technical

More information

Strategies for Sound Internet Measurement

Strategies for Sound Internet Measurement Strategies for Sound Internet Measurement Vern Paxson Presented by Hossein Falaki Vern Paxson M.S. and Ph.D. degrees Berkeley Staff scientist at the Lawrence Berkeley National Laboratory Founder of the

More information

Data publication and discovery with Globus

Data publication and discovery with Globus Data publication and discovery with Globus Questions and comments to outreach@globus.org The Globus data publication and discovery services make it easy for institutions and projects to establish collections,

More information

Computational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich

Computational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich Computational Databases: Inspirations from Statistical Software Linnea Passing, linnea.passing@tum.de Technical University of Munich Data Science Meets Databases Data Cleansing Pipelines Fuzzy joins Data

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove High Performance Data Analytics for Numerical Simulations Bruno Raffin DataMove bruno.raffin@inria.fr April 2016 About this Talk HPC for analyzing the results of large scale parallel numerical simulations

More information

Toward Scalable Monitoring on Large-Scale Storage for Software Defined Cyberinfrastructure

Toward Scalable Monitoring on Large-Scale Storage for Software Defined Cyberinfrastructure Toward Scalable Monitoring on Large-Scale Storage for Software Defined Cyberinfrastructure Arnab K. Paul, Ryan Chard, Kyle Chard, Steven Tuecke, Ali R. Butt, Ian Foster Virginia Tech, Argonne National

More information

A Data Diffusion Approach to Large Scale Scientific Exploration

A Data Diffusion Approach to Large Scale Scientific Exploration A Data Diffusion Approach to Large Scale Scientific Exploration Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Joint work with: Yong Zhao: Microsoft Ian Foster:

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information

Fast Forward I/O & Storage

Fast Forward I/O & Storage Fast Forward I/O & Storage Eric Barton Lead Architect 1 Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7

More information

Executing Evaluations over Semantic Technologies using the SEALS Platform

Executing Evaluations over Semantic Technologies using the SEALS Platform Executing Evaluations over Semantic Technologies using the SEALS Platform Miguel Esteban-Gutiérrez, Raúl García-Castro, Asunción Gómez-Pérez Ontology Engineering Group, Departamento de Inteligencia Artificial.

More information

dan.fay@microsoft.com Scientific Data Intensive Computing Workshop 2004 Visualizing and Experiencing E 3 Data + Information: Provide a unique experience to reduce time to insight and knowledge through

More information

A High-Level Distributed Execution Framework for Scientific Workflows

A High-Level Distributed Execution Framework for Scientific Workflows A High-Level Distributed Execution Framework for Scientific Workflows Jianwu Wang 1, Ilkay Altintas 1, Chad Berkley 2, Lucas Gilbert 1, Matthew B. Jones 2 1 San Diego Supercomputer Center, UCSD, U.S.A.

More information

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal

More information

Kepler and Grid Systems -- Early Efforts --

Kepler and Grid Systems -- Early Efforts -- Distributed Computing in Kepler Lead, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, (Joint work with Matthew Jones) 6th Biennial Ptolemy Miniconference Berkeley,

More information

Extreme I/O Scaling with HDF5

Extreme I/O Scaling with HDF5 Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1 Outline Brief overview of

More information

Scalable, Automated Characterization of Parallel Application Communication Behavior

Scalable, Automated Characterization of Parallel Application Communication Behavior Scalable, Automated Characterization of Parallel Application Communication Behavior Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory 12 th Scalable Tools Workshop

More information

Enabling a SuperFacility with Software Defined Networking

Enabling a SuperFacility with Software Defined Networking Enabling a SuperFacility with Software Defined Networking Shane Canon Tina Declerck, Brent Draney, Jason Lee, David Paul, David Skinner May 2017 CUG 2017-1 - SuperFacility - Defined Combining the capabilities

More information

Introduction to Geodatabase and Spatial Management in ArcGIS. Craig Gillgrass Esri

Introduction to Geodatabase and Spatial Management in ArcGIS. Craig Gillgrass Esri Introduction to Geodatabase and Spatial Management in ArcGIS Craig Gillgrass Esri Session Path The Geodatabase - What is it? - Why use it? - What types are there? - What can I do with it? Query Layers

More information

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Page 1 of 5 1 Year 1 Proposal Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Year 1 Progress Report & Year 2 Proposal In order to setup the context for this progress

More information

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010 Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010 Windows HPC Server 2008 R2 Windows HPC Server 2008 R2 makes supercomputing

More information

File System Interface and Implementation

File System Interface and Implementation Unit 8 Structure 8.1 Introduction Objectives 8.2 Concept of a File Attributes of a File Operations on Files Types of Files Structure of File 8.3 File Access Methods Sequential Access Direct Access Indexed

More information

Introduction to High Performance Parallel I/O

Introduction to High Performance Parallel I/O Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing

More information

Introduction to The Storage Resource Broker

Introduction to The Storage Resource Broker http://www.nesc.ac.uk/training http://www.ngs.ac.uk Introduction to The Storage Resource Broker http://www.pparc.ac.uk/ http://www.eu-egee.org/ Policy for re-use This presentation can be re-used for academic

More information

Distributed Memory Parallel Markov Random Fields Using Graph Partitioning

Distributed Memory Parallel Markov Random Fields Using Graph Partitioning Distributed Memory Parallel Markov Random Fields Using Graph Partitioning Colleen Heinemann, Talita Perciano, Daniela Ushizima, Wes Bethel December 11, 2017 Overview What is MRF-based image segmentation?

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB Grids and Workflows Overview Scientific workflows and Grids Taxonomy Example systems Kepler revisited Data Grids Chimera GridDB 2 Workflows and Grids Given a set of workflow tasks and a set of resources,

More information

Enosis: Bridging the Semantic Gap between

Enosis: Bridging the Semantic Gap between Enosis: Bridging the Semantic Gap between File-based and Object-based Data Models Anthony Kougkas - akougkas@hawk.iit.edu, Hariharan Devarajan, Xian-He Sun Outline Introduction Background Approach Evaluation

More information

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT. Chapter 4:- Introduction to Grid and its Evolution Prepared By:- Assistant Professor SVBIT. Overview Background: What is the Grid? Related technologies Grid applications Communities Grid Tools Case Studies

More information

Commercial Data Intensive Cloud Computing Architecture: A Decision Support Framework

Commercial Data Intensive Cloud Computing Architecture: A Decision Support Framework Association for Information Systems AIS Electronic Library (AISeL) CONF-IRM 2014 Proceedings International Conference on Information Resources Management (CONF-IRM) 2014 Commercial Data Intensive Cloud

More information

Pegasus. Pegasus Workflow Management System. Mats Rynge

Pegasus. Pegasus Workflow Management System. Mats Rynge Pegasus Pegasus Workflow Management System Mats Rynge rynge@isi.edu https://pegasus.isi.edu Automate Why workflows? Recover Automates complex, multi-stage processing pipelines Enables parallel, distributed

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Compilers and Compiler-based Tools for HPC

Compilers and Compiler-based Tools for HPC Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms

More information

Shifter: Fast and consistent HPC workflows using containers

Shifter: Fast and consistent HPC workflows using containers Shifter: Fast and consistent HPC workflows using containers CUG 2017, Redmond, Washington Lucas Benedicic, Felipe A. Cruz, Thomas C. Schulthess - CSCS May 11, 2017 Outline 1. Overview 2. Docker 3. Shifter

More information

HPC on Sun Today and Tomorrow VIRACOCHA: An Efficient Parallelization Framework Processing in Virtual Environments. Andreas Gerndt

HPC on Sun Today and Tomorrow VIRACOCHA: An Efficient Parallelization Framework Processing in Virtual Environments. Andreas Gerndt HPC on Sun Today and Tomorrow VIRACOCHA: An Efficient Parallelization Framework for Large-Scale CFD Post-Processing Processing in Virtual Environments Andreas Gerndt Aachen University (RWTH), Germany Center

More information

Online Monitoring of I/O

Online Monitoring of I/O Introduction On-line Monitoring Framework Evaluation Summary References Research Group German Climate Computing Center 23-03-2017 Introduction On-line Monitoring Framework Evaluation Summary References

More information

ICAT Job Portal. a generic job submission system built on a scientific data catalog. IWSG 2013 ETH, Zurich, Switzerland 3-5 June 2013

ICAT Job Portal. a generic job submission system built on a scientific data catalog. IWSG 2013 ETH, Zurich, Switzerland 3-5 June 2013 ICAT Job Portal a generic job submission system built on a scientific data catalog IWSG 2013 ETH, Zurich, Switzerland 3-5 June 2013 Steve Fisher, Kevin Phipps and Dan Rolfe Rutherford Appleton Laboratory

More information

The State and Needs of IO Performance Tools

The State and Needs of IO Performance Tools The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA August 6 12, 2017 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National

More information

Taming Metadata Storms in Parallel Filesystems with MetaFS. Tim Shaffer

Taming Metadata Storms in Parallel Filesystems with MetaFS. Tim Shaffer Taming Metadata Storms in Parallel Filesystems with MetaFS Tim Shaffer Motivation A (well-meaning) user tried to run a bioinformatics pipeline to analyze a batch of genomic data. 2 Motivation Shared filesystem

More information

GraphTrek: Asynchronous Graph Traversal for Property Graph-Based Metadata Management

GraphTrek: Asynchronous Graph Traversal for Property Graph-Based Metadata Management GraphTrek: Asynchronous Graph Traversal for Property Graph-Based Metadata Management Dong Dai, Philip Carns, Robert B. Ross, John Jenkins, Kyle Blauer, and Yong Chen Metadata Management Challenges in HPC

More information

The Why and How of HPC-Cloud Hybrids with OpenStack

The Why and How of HPC-Cloud Hybrids with OpenStack The Why and How of HPC-Cloud Hybrids with OpenStack OpenStack Australia Day Melbourne June, 2017 Lev Lafayette, HPC Support and Training Officer, University of Melbourne lev.lafayette@unimelb.edu.au 1.0

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallel In-situ Data Processing Techniques

Parallel In-situ Data Processing Techniques Parallel In-situ Data Processing Techniques Florin Rusu, Yu Cheng University of California, Merced Outline Background SCANRAW Operator Speculative Loading Evaluation Astronomy: FITS Format Sloan Digital

More information

Data Intensive processing with irods and the middleware CiGri for the Whisper project Xavier Briand

Data Intensive processing with irods and the middleware CiGri for the Whisper project Xavier Briand and the middleware CiGri for the Whisper project Use Case of Data-Intensive processing with irods Collaboration between: IT part of Whisper: Sofware development, computation () Platform Ciment: IT infrastructure

More information

INTRODUCTION TO DATA MINING

INTRODUCTION TO DATA MINING INTRODUCTION TO DATA MINING 1 Chiara Renso KDDLab - ISTI CNR, Italy http://www-kdd.isti.cnr.it email: chiara.renso@isti.cnr.it Knowledge Discovery and Data Mining Laboratory, ISTI National Research Council,

More information

PatternFinder is a tool that finds non-overlapping or overlapping patterns in any input sequence.

PatternFinder is a tool that finds non-overlapping or overlapping patterns in any input sequence. PatternFinder is a tool that finds non-overlapping or overlapping patterns in any input sequence. Pattern Finder Input Parameters: USAGE: PatternDetective.exe [ -help /? -f [filename] -min -max [minimum

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Vis: Online Analysis Tool for Lattice QCD

Vis: Online Analysis Tool for Lattice QCD : Online Analysis Tool for Lattice QCD School of Computing - DePaul University - Chicago E-mail: mdipierro@cs.depaul.edu Yaoqian Zhong School of Computing - DePaul University - Chicago E-mail: ati_zhong@hotmail.com

More information

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh. Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the

More information

File Control System 1.0 Product Requirements Document (PRD)

File Control System 1.0 Product Requirements Document (PRD) File Control System 1.0 Product Requirements Document (PRD) Author: Ken Rodham Date: January 10, 2005 Revision: 2 Overview This document specifies the requirements for the File Control System 1.0 (FCS).

More information

Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory CONTAINERS IN HPC WITH SINGULARITY

Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory CONTAINERS IN HPC WITH SINGULARITY Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory gmkurtzer@lbl.gov CONTAINERS IN HPC WITH SINGULARITY A QUICK REVIEW OF THE LANDSCAPE Many types of virtualization

More information

Starting small to go Big: Building a Living Database

Starting small to go Big: Building a Living Database Starting small to go Big: Building a Living Database Michael Sabbatino 1,2, Baker, D.V. Vic 3,4, Rose, K. 1, Romeo, L. 1,2, Bauer, J. 1, and Barkhurst, A. 3,4 1 US Department of Energy, National Energy

More information

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton

More information

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3

More information

The Grid Architecture

The Grid Architecture U.S. Department of Energy Office of Science The Grid Architecture William E. Johnston Distributed Systems Department Computational Research Division Lawrence Berkeley National Laboratory dsd.lbl.gov What

More information

The iplant Data Commons

The iplant Data Commons The iplant Data Commons Using irods to Facilitate Data Dissemination, Discovery, and Reproducibility Jeremy DeBarry, jdebarry@iplantcollaborative.org Tony Edgin, tedgin@iplantcollaborative.org Nirav Merchant,

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

The Storage Networking Industry Association (SNIA) Data Preservation and Metadata Projects. Bob Rogers, Application Matrix

The Storage Networking Industry Association (SNIA) Data Preservation and Metadata Projects. Bob Rogers, Application Matrix The Storage Networking Industry Association (SNIA) Data Preservation and Metadata Projects Bob Rogers, Application Matrix Overview The Self Contained Information Retention Format Rationale & Objectives

More information

The Constellation Project. Andrew W. Nash 14 November 2016

The Constellation Project. Andrew W. Nash 14 November 2016 The Constellation Project Andrew W. Nash 14 November 2016 The Constellation Project: Representing a High Performance File System as a Graph for Analysis The Titan supercomputer utilizes high performance

More information

Introduction to OS. File Management. MOS Ch. 4. Mahmoud El-Gayyar. Mahmoud El-Gayyar / Introduction to OS 1

Introduction to OS. File Management. MOS Ch. 4. Mahmoud El-Gayyar. Mahmoud El-Gayyar / Introduction to OS 1 Introduction to OS File Management MOS Ch. 4 Mahmoud El-Gayyar elgayyar@ci.suez.edu.eg Mahmoud El-Gayyar / Introduction to OS 1 File Management Objectives Provide I/O support for a variety of storage device

More information

TagFS: A simple tag-based filesystem

TagFS: A simple tag-based filesystem TagFS: A simple tag-based filesystem Scott Bezek sbezek@mit.edu Raza (R07) 6.033 Design Project 1 March 17, 2011 1 Introduction TagFS is a simple yet effective tag-based filesystem. Instead of organizing

More information

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Review Segmentation Segmentation Implementation Advantage of Segmentation Protection Sharing Segmentation with Paging Segmentation with Paging Segmentation with Paging Reason for the segmentation with

More information

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation

More information

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel National Center for Supercomputing Applications University of Illinois

More information

Simulation Data Shaping using a Protocol Independent Simulation Framework (PISF) on a Service Oriented Simulation Architecture (SOSA)

Simulation Data Shaping using a Protocol Independent Simulation Framework (PISF) on a Service Oriented Simulation Architecture (SOSA) Simulation Shaping using a Protocol Independent Simulation Framework (PISF) on a Oriented Simulation Architecture (SOSA) Abhay Misra; Craig Pepper Systems Analysis Laboratory, Boeing Defence Australia

More information

SDSS Dataset and SkyServer Workloads

SDSS Dataset and SkyServer Workloads SDSS Dataset and SkyServer Workloads Overview Understanding the SDSS dataset composition and typical usage patterns is important for identifying strategies to optimize the performance of the AstroPortal

More information

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations Photos placed in horizontal position with even amount of white space between photos and header Margaret Lawson, Jay Lofstead,

More information

Log-structured files for fast checkpointing

Log-structured files for fast checkpointing Log-structured files for fast checkpointing Milo Polte Jiri Simsa, Wittawat Tantisiriroj,Shobhit Dayal, Mikhail Chainani, Dilip Kumar Uppugandla, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University

More information

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue

More information

PSICon Daniel G. A. Smith The Molecular Sciences Software molssi.org

PSICon Daniel G. A. Smith The Molecular Sciences Software molssi.org PSICon 2018 Daniel G. A. Smith The Molecular Sciences Software Institute @dga_smith molssi.org MolSSI Education Initiatives How do we change the software practices of an entire field? Primary objectives:

More information

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto Ricardo Rocha Department of Computer Science Faculty of Sciences University of Porto Slides based on the book Operating System Concepts, 9th Edition, Abraham Silberschatz, Peter B. Galvin and Greg Gagne,

More information