SDS: A Framework for Scientific Data Services

Size: px
Start display at page:

Download "SDS: A Framework for Scientific Data Services"

Transcription

1 SDS: A Framework for Scientific Data Services Bin Dong, Suren Byna*, John Wu Scientific Data Management Group Lawrence Berkeley National Laboratory

2 Finding Newspaper Articles of Interest Finding news articles of your interest in the old era of print newspaper Turn pages Read headlines Searching for specific news was time consuming Online newspapers Search box Limited to one newspaper Customized News Articles are organized based on reader s interest Search numerous online newspapers 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 2

3 Finding Newspaper Articles of Interest Finding news articles of your interest in the old era of print newspaper Turn pages Read headlines Searching for specific news was time consuming Online newspapers Search box Limited to one newspaper Customized News Articles are organized based on reader s interest Search numerous online newspapers 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 3

4 Finding Newspaper Articles of Interest Finding news articles of your interest in the old era of print newspaper Turn pages Read headlines Searching for specific news was time consuming Online newspapers Search box Limited to one newspaper Customized News Articles are organized based on reader s interest Search numerous online newspapers 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 4

5 Finding Newspaper Articles of Interest Finding news articles of your interest in the old era of print newspaper Turn pages Read headlines Searching for specific news was time consuming Online newspapers Search box Limited to one newspaper Customized News Articles are organized based on reader s interest à Better organization, faster access to news Search numerous online newspapers 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 5

6 Scientific Discovery needs Fast Data Analysis Modern scientific discoveries are driven by massive data Scientific simulations and experiments in many domains produce data in the range of terabytes and petabytes LSST image simulator LHC: Higgs Boson search Fast data analysis is an important requirement for discovery Data is typically stored as files on disks that are managed by file systems High data read performance from storage is critical 11/18/2013 Light source experiments at LCLS, ALS, SNS, etc. NCAR s CESM data Image Credits Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 6

7 A Generic Data Analysis Use Case Simulation Site Exascale Simulation Machine + analysis Parallel Storage Perform some data analysis on exascale machine (e.g. in situ pattern identification) Experiment/observation Site Experiment/observation Processing Machine (Parallel) Storage Archive Archive Analysis Sites Need to reduce EBs and PBs of data, and move only TBs from simulation sites Analysis Analysis Analysis Machine Machines Shared Shared Shared storage storage storage 11/18/ Reduce and prepare data for further exploratory Analysis (e.g., Data mining)

8 Current Practice: Immutable Data in File Systems Data stored on file systems is immutable After data producers (simulations and experiments) store data, the file format/ organization is unchanged over time Data stored in files, which is treated by the system as a sequence of bytes User code (and libraries) has to impose meaning of the file layout and conduct analyses Burden of optimizing read performance falls on application developer However, optimizations are complex due to several layers of parallel I/O stack Applications High-level I/O libraries and formats High-level I/O libraries and formats I/O Optimization Layers Parallel File System Storage Hardware Parallel I/O Software Stack 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 8

9 Changing layout of data on file systems is helpful Separation of logical and physical data organization through reorganization Example Access all variables where 1.15 < Energy < 1.4 Full scan of data or random accesses to non-contiguous data locations results in poor performance Sorting the data and accessing contiguous chunks of data leads to good performance Energy X Y Z Several studies showing reorganization can help Listed some in the related work section 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 9

10 Changing layout of data on file systems is helpful Separation of logical and physical data organization through reorganization Example Access all variables where 1.15 < Energy < 1.4 Full scan of data or random accesses to non-contiguous data locations results in poor performance Sorting the data and accessing contiguous chunks of data leads to good performance Several studies showing reorganization can help Listed some in the related work section Energy X Y Z /18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 10

11 Changing layout of data on file systems is helpful Separation of logical and physical data organization through reorganization Example Access all variables where 1.15 < Energy < 1.4 Full scan of data or random accesses to non-contiguous data locations results in poor performance Sorting the data and accessing contiguous chunks of data leads to good performance Energy X Y Z Several studies showing reorganization can help Listed some in the related work section 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 11

12 Database Systems for Scientific Data Management Database management systems perform various optimizations without user involvement Many efforts from database community reach out to manage scientific data Objectivity/DB, Array QL, SciQL, SciDB, etc. SciDB A database system for scientific applications Key features: Array-oriented data model Append-only storage First-class support for user-defined functions Massively parallel computations 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 12

13 More DB Optimizations, but Database systems use various optimization tricks and perform them automatically Cache frequently accessed data Use indexes Store materialized views However, preparing and loading of scientific data to fit into database schemas is cumbersome Scientific data management requirements are different from DBMS Established data formats (HDF5, NetCDF, ROOT, FITS, ADIOS-BP, etc.) Arrays are first class citizens Highly concurrent applications 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 13

14 Our Solution: Scientific Data Services (SDS) Preserving scientific data formats and adding database optimizations as services A framework for bringing the merits of database technologies to scientific data Examples of services Indexing Sorting Reorganization to improve data locality Compression Remote and selective data movement 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 14

15 First step: Automatic Data Reorganization Finding an optimal data organization strategy Capture common read patterns Estimate cost with various data organization strategies Rank data organizations for each pattern and select the best for a pattern Performing data reorganization Similar to database management systems, perform data reorganization transparently Resolving permissions to the original data Preserve the original permissions when data is reorganized Using an optimally reorganized dataset transparently Based on a read pattern, recognize the best available organization of data Redirect to read the selected organization Perform any needed post processing, such as decompression, transposition 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 15

16 The Design of the SDS Framework Initial implementation of the SDS framework Client-server approach SDS Client library for each MPI process SDS Server to find the best dataset from available reorganized datasets Supporting HDF5 Read API Added SDS Query API for answering range queries MPI(Process(0( Applica4on( MPI(Process(N( Applica4on( HDF5(API( SDS(Query(API( HDF5(API( SDS(Query(API( SDS(Client( SDS(Client( Network( SDS(Server( Parallel&File&System&& Database( Batch(System( 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 16

17 SDS Client Implementation HDF5 API Requested& Data& Post& Processor& SDS&Query& Interface& File&Handle& and& Query&String&& ApplicaAon& Parser& SDS&&Client& HDF5&API& File&Handle&and&Read&Parameters& File&Handle& and& Final&Query&String& Server& Connector& SDS&&Server& The HDF5 Virtual Object Layer (VOL) feature allows capturing HDF5 calls Developed a VOL plugin for SDS for capturing file open, read, and close functions Data& Reader& Original&File&Metadata&or&& Reorganized&File&Metadata& Parallel&File&System& 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 17

18 SDS Client Implementation Requested& Data& Post& Processor& SDS&Query& Interface& File&Handle& and& Query&String&& ApplicaAon& Parser& SDS&&Client& HDF5&API& File&Handle&and&Read&Parameters& File&Handle& and& Final&Query&String& Server& Connector& SDS&&Server& SDS Query Interface An interface to perform SQL-style queries on arrays Function-based API that can be used from C/C++ applications Data& Reader& Original&File&Metadata&or&& Reorganized&File&Metadata& Parallel&File&System& 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 18

19 SDS Client Implementation ApplicaAon& Requested& Data& Post& Processor& SDS&Query& Interface& File&Handle& and& Query&String&& Parser& SDS&&Client& HDF5&API& File&Handle&and&Read&Parameters& File&Handle& and& Final&Query&String& Server& Connector& SDS&&Server& Parser Checks the conditions in a query Verifies the validity of files Data& Reader& Original&File&Metadata&or&& Reorganized&File&Metadata& Parallel&File&System& 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 19

20 SDS Client Implementation ApplicaAon& Requested& Data& Post& Processor& Data& SDS&Query& Interface& File&Handle& and& Query&String&& Reader& Parser& Parallel&File&System& SDS&&Client& HDF5&API& File&Handle&and&Read&Parameters& File&Handle& and& Final&Query&String& Original&File&Metadata&or&& Reorganized&File&Metadata& Server& Connector& SDS&&Server& Server Connector Packages a query or HDF5 read call information and sends to the SDS server Using protocol buffers for communication MPI Rank 0 communicates with the server and then informs the remaining MPI processes 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 20

21 SDS Client Implementation ApplicaAon& Requested& Data& Post& Processor& Data& SDS&Query& Interface& File&Handle& and& Query&String&& Reader& Parser& Parallel&File&System& SDS&&Client& HDF5&API& File&Handle&and&Read&Parameters& File&Handle& and& Final&Query&String& Original&File&Metadata&or&& Reorganized&File&Metadata& Server& Connector& SDS&&Server& Reader Reads data from the dataset location returned by the SDS Server Makes use of the native HDF5 read calls 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 21

22 SDS Client Implementation ApplicaAon& Requested& Data& Post& Processor& Data& SDS&Query& Interface& File&Handle& and& Query&String&& Reader& Parser& Parallel&File&System& SDS&&Client& HDF5&API& File&Handle&and&Read&Parameters& File&Handle& and& Final&Query&String& Original&File&Metadata&or&& Reorganized&File&Metadata& Server& Connector& SDS&&Server& Post Processor Performs any postprocessing needed before returning the data to the application memory Eg. Decompression, transposition, etc. 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 22

23 SDS Server Implementation Client#Request# Query#Request# Query#Response# SDS#Client# Request# Dispatcher# Query# Evaluator# User#Reorganiza6on#Request# Periodic#SelfJstart#Request# Reorganiza6on## Request# Organiza6on#List# Reorganiza6on# Evaluator# Organiza6on# List# Data#Organiza6on# Recommender# SDS#Admin#Interface# SDS#Server# File#Handle## and## Reorganiza6on#Type# Data# Reorganizer# Job#Script# Job#Results# Reorganiza6on##Job##Running# Read#Sta6s6cs# Manager# Frequently#Read#File# Handle#and#its#SDS# Metadata# ACribute#of##Reorganized#file# Original#File## Reorganized#File# File#Data# Parallel#File#System# 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 23

24 SDS Server Implementation Request Dispatcher Client#Request# Query#Request# Query#Response# SDS#Client# Request# Dispatcher# Query# Evaluator# User#Reorganiza6on#Request# Periodic#SelfJstart#Request# Read#Sta6s6cs# Reorganiza6on## Request# Organiza6on#List# Manager# Reorganiza6on# Evaluator# Organiza6on# List# Data#Organiza6on# Recommender# Frequently#Read#File# Handle#and#its#SDS# Metadata# SDS#Admin#Interface# File#Handle## and## Reorganiza6on#Type# Data# Reorganizer# ACribute#of##Reorganized#file# SDS#Server# Job#Script# Job#Results# Original#File## Reorganiza6on##Job##Running# Reorganized#File# Receives SDS client requests and SDS Admin interface SDS Admin interface issues reorganization commands Based on the request, dispatcher passes on the request to Query Evaluator and Reorganization Evaluator File#Data# Parallel#File#System# 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 24

25 SDS Server Implementation Query Evaluator Client#Request# Query#Request# Query#Response# SDS#Client# Request# Dispatcher# Query# Evaluator# User#Reorganiza6on#Request# Periodic#SelfJstart#Request# Read#Sta6s6cs# Reorganiza6on## Request# Organiza6on#List# Manager# Reorganiza6on# Evaluator# Organiza6on# List# Data#Organiza6on# Recommender# Frequently#Read#File# Handle#and#its#SDS# Metadata# SDS#Admin#Interface# File#Handle## and## Reorganiza6on#Type# Data# Reorganizer# ACribute#of##Reorganized#file# SDS#Server# Job#Script# Job#Results# Original#File## Reorganiza6on##Job##Running# Reorganized#File# Looks up SDS Metadata for finding available reorganized datasets and their locations for a given dataset SDS Metadata File name, HDF5 dataset info, permissions File#Data# Parallel#File#System# 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 25

26 SDS Server Implementation Client#Request# Query#Request# Query#Response# SDS#Client# Request# Dispatcher# Query# Evaluator# User#Reorganiza6on#Request# Periodic#SelfJstart#Request# Read#Sta6s6cs# Reorganiza6on## Request# Organiza6on#List# Manager# Reorganiza6on# Evaluator# Organiza6on# List# Data#Organiza6on# Recommender# Frequently#Read#File# Handle#and#its#SDS# Metadata# SDS#Admin#Interface# File#Handle## and## Reorganiza6on#Type# Data# Reorganizer# ACribute#of##Reorganized#file# SDS#Server# Job#Script# Job#Results# Original#File## Reorganiza6on##Job##Running# Reorganized#File# Reorganization Evaluator Decides whether to reorganize based on the frequency of read accesses Takes commands from the Admin interface Instructs Data Reorganizer to create a reorganization job script File#Data# Parallel#File#System# 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 26

27 SDS Server Implementation Client#Request# Query#Request# Query#Response# SDS#Client# Request# Dispatcher# Query# Evaluator# User#Reorganiza6on#Request# Periodic#SelfJstart#Request# Read#Sta6s6cs# Reorganiza6on## Request# Organiza6on#List# Manager# Reorganiza6on# Evaluator# Organiza6on# List# Data#Organiza6on# Recommender# Frequently#Read#File# Handle#and#its#SDS# Metadata# SDS#Admin#Interface# File#Handle## and## Reorganiza6on#Type# Data# Reorganizer# ACribute#of##Reorganized#file# SDS#Server# Job#Script# Job#Results# Original#File## Reorganiza6on##Job##Running# Reorganized#File# Data Organization Recommender Identifies optimal data reorganization Informs the Reorganization Evaluator with the selected strategy File#Data# Parallel#File#System# 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 27

28 SDS Server Implementation File#Data# Client#Request# Query#Request# Query#Response# SDS#Client# Request# Dispatcher# Query# Evaluator# User#Reorganiza6on#Request# Periodic#SelfJstart#Request# Read#Sta6s6cs# Reorganiza6on## Request# Organiza6on#List# Manager# Reorganiza6on# Evaluator# Organiza6on# List# Data#Organiza6on# Recommender# Frequently#Read#File# Handle#and#its#SDS# Metadata# File#Handle## and## Reorganiza6on#Type# Data# Reorganizer# ACribute#of##Reorganized#file# Parallel#File#System# SDS#Admin#Interface# SDS#Server# Job#Script# Job#Results# Original#File## Reorganiza6on##Job##Running# Reorganized#File# Data Reorganizer Locates reorganization code, such as sorting, indexing algorithms Decides on the number of cores to use Prepares a batch job script Monitors the job execution After reorganization, stores the new data location in the SDS Metadata Manager 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 28

29 Performance Evaluation: Setup NERSC Hopper supercomputer 6,384 compute nodes 2 twelve-core AMD 'MagnyCours' 2.1-GHz processors per node 32 GB DDR MHz memory per node Employs the Gemini interconnect with a 3D torus topology Lustre parallel file system with 156 OSTs at a peak BW of 35 GB/s Software Cray s MPI library HDF5 Virtual Object Layer code base Data Particle data from a plasma physics simulation (VPIC), simulating magnetic reconnection phenomenon in space weather 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 29

30 Performance Evaluation: Overhead of SDS SDS Metadata Manager uses Berkeley DB for storing metadata of reorganized data Measured response time of multiple concurrent SDS Clients requesting metadata from one SDS Server Low overhead of < 0.5 seconds with 240 concurrent clients 1" 0.8" HDF5%Open+Close% SDS%Metadata%Read% Time"(sec)" 0.6" 0.4" 0.2" 0" 40" 80" 120" 160" 200" 240" 280" 320" Number"of"Concurent"Clients" 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 30

31 Performance Evaluation: Benefits of SDS Time%(sec)% Ran various queries on particle energy conditions Performance benefit over full scan varies based on selectivity of data 20X to 50X speedup when data selectivity less than 5% Full%Data%Scan% Read%Reorganized%Data% % % % 80.00% 60.00% 40.00% 20.00% 0.00% 1.2X% 2.3X% 8.1X% 18.9X% 25.7X% 36.3X% 42.0X% 44.8X% 51.2X% E>1.10% E>1.15% E>1.20% E>1.25% E>1.30% E>1.35% E>1.40% E>1.45% E>1.50% Selectivity: 79% 32% 11% 4.6% 2.2% 1.2% 0.75% 0.5% 0.33% Query% 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 31

32 Conclusion and Future Work File systems on current supercomputers are analogous to print newspapers The SDS framework is vehicle for applying various optimizations Live customization of data organization based on data usage (in progress) To work with existing scientific data formats Status: Platform for performing various optimizations is ready Low overhead and high performance benefits Ongoing and future work: Dynamic recognition of read patterns Model-driven data organization optimizations In-memory query execution and query plan optimization FastBit indexing to support querying Support for more data formats 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 32

33 Thanks! Contact: Suren Byna Thanks to: 11/18/2013 Computational Research Division Lawrence Berkeley National Laboratory Department of Energy 33

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin Lawrence Berkeley National Laboratory Energy efficiency at Exascale A design goal for future

More information

ArrayUDF Explores Structural Locality for Faster Scientific Analyses

ArrayUDF Explores Structural Locality for Faster Scientific Analyses ArrayUDF Explores Structural Locality for Faster Scientific Analyses John Wu 1 Bin Dong 1, Surendra Byna 1, Jialin Liu 1, Weijie Zhao 2, Florin Rusu 1,2 1 LBNL, Berkeley, CA 2 UC Merced, Merced, CA Two

More information

Extreme I/O Scaling with HDF5

Extreme I/O Scaling with HDF5 Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1 Outline Brief overview of

More information

Introduction to High Performance Parallel I/O

Introduction to High Performance Parallel I/O Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

Fast Forward I/O & Storage

Fast Forward I/O & Storage Fast Forward I/O & Storage Eric Barton Lead Architect 1 Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7

More information

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Parallel, In Situ Indexing for Data-intensive Computing. Introduction FastQuery - LDAV /24/ Parallel, In Situ Indexing for Data-intensive Computing October 24, 2 Jinoh Kim, Hasan Abbasi, Luis Chacon, Ciprian Docan, Scott Klasky, Qing Liu, Norbert Podhorszki, Arie Shoshani,

More information

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy François Tessier, Venkatram Vishwanath Argonne National Laboratory, USA July 19,

More information

Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures

Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Mostofa Patwary 1, Nadathur Satish 1, Narayanan Sundaram 1, Jilalin Liu 2, Peter Sadowski 2, Evan Racah 2, Suren Byna 2, Craig

More information

Stream Processing for Remote Collaborative Data Analysis

Stream Processing for Remote Collaborative Data Analysis Stream Processing for Remote Collaborative Data Analysis Scott Klasky 146, C. S. Chang 2, Jong Choi 1, Michael Churchill 2, Tahsin Kurc 51, Manish Parashar 3, Alex Sim 7, Matthew Wolf 14, John Wu 7 1 ORNL,

More information

Big Data in HPC. John Shalf Lawrence Berkeley National Laboratory

Big Data in HPC. John Shalf Lawrence Berkeley National Laboratory Big Data in HPC John Shalf Lawrence Berkeley National Laboratory 1 Evolving Role of Supercomputing Centers Traditional Pillars of science Theory: mathematical models of nature Experiment: empirical data

More information

Taming Parallel I/O Complexity with Auto-Tuning

Taming Parallel I/O Complexity with Auto-Tuning Taming Parallel I/O Complexity with Auto-Tuning Babak Behzad 1, Huong Vu Thanh Luu 1, Joseph Huchette 2, Surendra Byna 3, Prabhat 3, Ruth Aydt 4, Quincey Koziol 4, Marc Snir 1,5 1 University of Illinois

More information

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday

More information

MDHIM: A Parallel Key/Value Store Framework for HPC

MDHIM: A Parallel Key/Value Store Framework for HPC MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system

More information

Big Data in Scientific Domains

Big Data in Scientific Domains Big Data in Scientific Domains Arie Shoshani Lawrence Berkeley National Laboratory BES Workshop August 2012 Arie Shoshani 1 The Scalable Data-management, Analysis, and Visualization (SDAV) Institute 2012-2017

More information

Caching and Buffering in HDF5

Caching and Buffering in HDF5 Caching and Buffering in HDF5 September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 1 Software stack Life cycle: What happens to data when it is transferred from application buffer to HDF5 file and from HDF5

More information

Informatica Data Explorer Performance Tuning

Informatica Data Explorer Performance Tuning Informatica Data Explorer Performance Tuning 2011 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

More information

Parallel I/O Libraries and Techniques

Parallel I/O Libraries and Techniques Parallel I/O Libraries and Techniques Mark Howison User Services & Support I/O for scientifc data I/O is commonly used by scientific applications to: Store numerical output from simulations Load initial

More information

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori

More information

libhio: Optimizing IO on Cray XC Systems With DataWarp

libhio: Optimizing IO on Cray XC Systems With DataWarp libhio: Optimizing IO on Cray XC Systems With DataWarp May 9, 2017 Nathan Hjelm Cray Users Group May 9, 2017 Los Alamos National Laboratory LA-UR-17-23841 5/8/2017 1 Outline Background HIO Design Functionality

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016 National Aeronautics and Space Administration Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures 13 November 2016 Carrie Spear (carrie.e.spear@nasa.gov) HPC Architect/Contractor

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Spark and HPC for High Energy Physics Data Analyses

Spark and HPC for High Energy Physics Data Analyses Spark and HPC for High Energy Physics Data Analyses Marc Paterno, Jim Kowalkowski, and Saba Sehrish 2017 IEEE International Workshop on High-Performance Big Data Computing Introduction High energy physics

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright njwright @ lbl.gov NERSC- National Energy Research Scientific Computing Center Mission: Accelerate the pace of scientific discovery

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Challenges in HPC I/O

Challenges in HPC I/O Challenges in HPC I/O Universität Basel Julian M. Kunkel German Climate Computing Center / Universität Hamburg 10. October 2014 Outline 1 High-Performance Computing 2 Parallel File Systems and Challenges

More information

Data Movement & Tiering with DMF 7

Data Movement & Tiering with DMF 7 Data Movement & Tiering with DMF 7 Kirill Malkin Director of Engineering April 2019 Why Move or Tier Data? We wish we could keep everything in DRAM, but It s volatile It s expensive Data in Memory 2 Why

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations Photos placed in horizontal position with even amount of white space between photos and header Margaret Lawson, Jay Lofstead,

More information

PreDatA - Preparatory Data Analytics on Peta-Scale Machines

PreDatA - Preparatory Data Analytics on Peta-Scale Machines PreDatA - Preparatory Data Analytics on Peta-Scale Machines Fang Zheng 1, Hasan Abbasi 1, Ciprian Docan 2, Jay Lofstead 1, Scott Klasky 3, Qing Liu 3, Manish Parashar 2, Norbert Podhorszki 3, Karsten Schwan

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10)

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10) FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10) 1 EMC September, 2013 John Bent john.bent@emc.com Sorin Faibish faibish_sorin@emc.com Xuezhao Liu xuezhao.liu@emc.com Harriet Qiu

More information

Using file systems at HC3

Using file systems at HC3 Using file systems at HC3 Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu Basic Lustre

More information

Blue Waters I/O Performance

Blue Waters I/O Performance Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract

More information

Open Data Standards for Administrative Data Processing

Open Data Standards for Administrative Data Processing University of Pennsylvania ScholarlyCommons 2018 ADRF Network Research Conference Presentations ADRF Network Research Conference Presentations 11-2018 Open Data Standards for Administrative Data Processing

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

API and Usage of libhio on XC-40 Systems

API and Usage of libhio on XC-40 Systems API and Usage of libhio on XC-40 Systems May 24, 2018 Nathan Hjelm Cray Users Group May 24, 2018 Los Alamos National Laboratory LA-UR-18-24513 5/24/2018 1 Outline Background HIO Design HIO API HIO Configuration

More information

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove High Performance Data Analytics for Numerical Simulations Bruno Raffin DataMove bruno.raffin@inria.fr April 2016 About this Talk HPC for analyzing the results of large scale parallel numerical simulations

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Dac-Man: Data Change Management for Scientific Datasets on HPC systems

Dac-Man: Data Change Management for Scientific Datasets on HPC systems Dac-Man: Data Change Management for Scientific Datasets on HPC systems Devarshi Ghoshal Lavanya Ramakrishnan Deborah Agarwal Lawrence Berkeley National Laboratory Email: dghoshal@lbl.gov Motivation Data

More information

Introduction to Parallel I/O

Introduction to Parallel I/O Introduction to Parallel I/O Bilel Hadri bhadri@utk.edu NICS Scientific Computing Group OLCF/NICS Fall Training October 19 th, 2011 Outline Introduction to I/O Path from Application to File System Common

More information

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26

More information

Applying DDN to Machine Learning

Applying DDN to Machine Learning Applying DDN to Machine Learning Jean-Thomas Acquaviva jacquaviva@ddn.com Learning from What? Multivariate data Image data Facial recognition Action recognition Object detection and recognition Handwriting

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat H5Spark H5Spark: Bridging the I/O Gap between Spark and Scien9fic Data Formats on HPC Systems Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike

More information

DATABASE MANAGEMENT SYSTEM ARCHITECTURE

DATABASE MANAGEMENT SYSTEM ARCHITECTURE DATABASE 1 MANAGEMENT SYSTEM ARCHITECTURE DBMS ARCHITECTURE 2 The logical DBMS architecture The physical DBMS architecture DBMS ARCHITECTURE 3 The logical DBMS architecture The logical architecture deals

More information

A Plugin for HDF5 using PLFS for Improved I/O Performance and Semantic Analysis

A Plugin for HDF5 using PLFS for Improved I/O Performance and Semantic Analysis 2012 SC Companion: High Performance Computing, Networking Storage and Analysis A for HDF5 using PLFS for Improved I/O Performance and Semantic Analysis Kshitij Mehta, John Bent, Aaron Torres, Gary Grider,

More information

Approaching the Petabyte Analytic Database: What I learned

Approaching the Petabyte Analytic Database: What I learned Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may

More information

The Red Storm System: Architecture, System Update and Performance Analysis

The Red Storm System: Architecture, System Update and Performance Analysis The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI

More information

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

Lustre architecture for Riccardo Veraldi for the LCLS IT Team Lustre architecture for LCLS@SLAC Riccardo Veraldi for the LCLS IT Team 2 LCLS Experimental Floor 3 LCLS Parameters 4 LCLS Physics LCLS has already had a significant impact on many areas of science, including:

More information

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

More information

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc. UK LUG 10 th July 2012 Lustre at Exascale Eric Barton CTO Whamcloud, Inc. eeb@whamcloud.com Agenda Exascale I/O requirements Exascale I/O model 3 Lustre at Exascale - UK LUG 10th July 2012 Exascale I/O

More information

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp www.cs.illinois.edu/~wgropp Messages Current I/O performance is often appallingly poor Even relative to what current systems can achieve

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis W I S S E N n T E C H N I K n L E I D E N S C H A F T Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis Elisabeth Lex KTI, TU Graz WS 2015/16 u www.tugraz.at

More information

Lustre Parallel Filesystem Best Practices

Lustre Parallel Filesystem Best Practices Lustre Parallel Filesystem Best Practices George Markomanolis Computational Scientist KAUST Supercomputing Laboratory georgios.markomanolis@kaust.edu.sa 7 November 2017 Outline Introduction to Parallel

More information

Welcome! Virtual tutorial starts at 15:00 BST

Welcome! Virtual tutorial starts at 15:00 BST Welcome! Virtual tutorial starts at 15:00 BST Parallel IO and the ARCHER Filesystem ARCHER Virtual Tutorial, Wed 8 th Oct 2014 David Henty Reusing this material This work is licensed

More information

Parallel I/O from a User s Perspective

Parallel I/O from a User s Perspective Parallel I/O from a User s Perspective HPC Advisory Council Stanford University, Dec. 6, 2011 Katie Antypas Group Leader, NERSC User Services NERSC is DOE in HPC Production Computing Facility NERSC computing

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

Execution Models for the Exascale Era

Execution Models for the Exascale Era Execution Models for the Exascale Era Nicholas J. Wright Advanced Technology Group, NERSC/LBNL njwright@lbl.gov Programming weather, climate, and earth- system models on heterogeneous muli- core plajorms

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Structuring PLFS for Extensibility

Structuring PLFS for Extensibility Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University What is PLFS? Parallel Log Structured File System Interposed filesystem b/w

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice 2014 年 3 月 13 日星期四 From Big Data to Big Value Infrastructure Needs and Huawei Best Practice Data-driven insight Making better, more informed decisions, faster Raw Data Capture Store Process Insight 1 Data

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Parser. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text.

Parser. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text. Lifecycle of an SQL Query CSE 190D base System Implementation Arun Kumar Query Query Result

More information

Dr. John Dennis

Dr. John Dennis Dr. John Dennis dennis@ucar.edu June 23, 2011 1 High-resolution climate generates a large amount of data! June 23, 2011 2 PIO update and Lustre optimizations How do we analyze high-resolution climate data

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Today (2014):

More information

RobinHood Project Status

RobinHood Project Status FROM RESEARCH TO INDUSTRY RobinHood Project Status Robinhood User Group 2015 Thomas Leibovici 9/18/15 SEPTEMBER, 21 st 2015 Project history... 1999: simple purge tool for HPC

More information

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT. Chapter 4:- Introduction to Grid and its Evolution Prepared By:- Assistant Professor SVBIT. Overview Background: What is the Grid? Related technologies Grid applications Communities Grid Tools Case Studies

More information

CSE 190D Database System Implementation

CSE 190D Database System Implementation CSE 190D Database System Implementation Arun Kumar Topic 1: Data Storage, Buffer Management, and File Organization Chapters 8 and 9 (except 8.5.4 and 9.2) of Cow Book Slide ACKs: Jignesh Patel, Paris Koutris

More information

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data Maaike Limper Emil Pilecki Manuel Martín Márquez About the speakers Maaike Limper Physicist and project leader Manuel Martín

More information

A Note on Interfacing Object Warehouses and Mass Storage Systems for Data Mining Applications *

A Note on Interfacing Object Warehouses and Mass Storage Systems for Data Mining Applications * A Note on Interfacing Object Warehouses and Mass Storage Systems for Data Mining Applications * Robert L. Grossman Magnify, Inc. University of Illinois at Chicago 815 Garfield Street Laboratory for Advanced

More information

Realtime Data Analytics at NERSC

Realtime Data Analytics at NERSC Realtime Data Analytics at NERSC Prabhat XLDB May 24, 2016-1 - Lawrence Berkeley National Laboratory - 2 - National Energy Research Scientific Computing Center 3 NERSC is the Production HPC & Data Facility

More information

In Situ Generated Probability Distribution Functions for Interactive Post Hoc Visualization and Analysis

In Situ Generated Probability Distribution Functions for Interactive Post Hoc Visualization and Analysis In Situ Generated Probability Distribution Functions for Interactive Post Hoc Visualization and Analysis Yucong (Chris) Ye 1, Tyson Neuroth 1, Franz Sauer 1, Kwan-Liu Ma 1, Giulio Borghesi 2, Aditya Konduri

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

End-to-End Data Integrity in the Intel/EMC/HDF Group Exascale IO DOE Fast Forward Project

End-to-End Data Integrity in the Intel/EMC/HDF Group Exascale IO DOE Fast Forward Project End-to-End Data Integrity in the Intel/EMC/HDF Group Exascale IO DOE Fast Forward Project As presented by John Bent, EMC and Quincey Koziol, The HDF Group Truly End-to-End App provides checksum buffer

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24 FILE SYSTEMS, PART 2 CS124 Operating Systems Fall 2017-2018, Lecture 24 2 Last Time: File Systems Introduced the concept of file systems Explored several ways of managing the contents of files Contiguous

More information

Jingren Zhou. Microsoft Corp.

Jingren Zhou. Microsoft Corp. Jingren Zhou Microsoft Corp. Microsoft Bing Infrastructure BING applications fall into two broad categories: Back-end: Massive batch processing creates new datasets Front-end: Online request processing

More information

HPC Storage Use Cases & Future Trends

HPC Storage Use Cases & Future Trends Oct, 2014 HPC Storage Use Cases & Future Trends Massively-Scalable Platforms and Solutions Engineered for the Big Data and Cloud Era Atul Vidwansa Email: atul@ DDN About Us DDN is a Leader in Massively

More information

CIS Operating Systems File Systems. Professor Qiang Zeng Fall 2017

CIS Operating Systems File Systems. Professor Qiang Zeng Fall 2017 CIS 5512 - Operating Systems File Systems Professor Qiang Zeng Fall 2017 Previous class I/O subsystem: hardware aspect Terms: controller, bus, port Addressing: port-mapped IO and memory-mapped IO I/O subsystem:

More information