Triton file systems - an introduction. slide 1 of 28

Similar documents
Data storage on Triton: an introduction

Deep Learning on SHARCNET:

Using file systems at HC3

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

Lustre Parallel Filesystem Best Practices

Welcome! Virtual tutorial starts at 15:00 BST

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Overview of High Performance Input/Output on LRZ HPC systems. Christoph Biardzki Richard Patra Reinhold Bader

New User Seminar: Part 2 (best practices)

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

CS370 Operating Systems

ASN Configuration Best Practices

I/O: State of the art and Future developments

The JANUS Computing Environment

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Feedback on BeeGFS. A Parallel File System for High Performance Computing

CS370 Operating Systems

Current Topics in OS Research. So, what s hot?

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

The Google File System. Alexandru Costan

Data Movement and Storage. 04/07/09 1

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC)

CS370 Operating Systems

Introduction to Scientific Data Management

CSE 153 Design of Operating Systems

Slurm basics. Summer Kickstart June slide 1 of 49

CA485 Ray Walshe Google File System

Scientific Computing in practice

The RAMDISK Storage Accelerator

Distributed Systems 16. Distributed File Systems II

Introduction to Scientific Data Management

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

COS 318: Operating Systems. Journaling, NFS and WAFL

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 20 April 2011 Spring 2011 Exam 2

High Performance Computing Cluster Advanced course

IME Infinite Memory Engine Technical Overview

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Fall HDFS Basics

Introduction to High Performance Parallel I/O

An Introduction to Cluster Computing Using Newton

Distributed Filesystem

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Map-Reduce. Marco Mura 2010 March, 31th

Slurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015

UNIT-IV HDFS. Ms. Selva Mary. G

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Chapter 11: Implementing File

Filesystems on SSCK's HP XC6000

Version Control with Git ME 461 Fall 2018

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

CS 318 Principles of Operating Systems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

An Overview of Fujitsu s Lustre Based File System

CS 537 Fall 2017 Review Session

IST346. Data Storage

Structuring PLFS for Extensibility

CS370 Operating Systems

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

The BioHPC Nucleus Cluster & Future Developments

abstract 2015 Progress Software Corporation.

Outline. ASP 2012 Grid School

Extraordinary HPC file system solutions at KIT

File Systems. What do we need to know?

Introduction to BioHPC

Dealing with Large Datasets. or, So I have 40TB of data.. Jonathan Dursi, SciNet/CITA, University of Toronto

Professor: Pete Keleher! Closures, candidate keys, canonical covers etc! Armstrong axioms!

Google File System. Arun Sundaram Operating Systems

Chapter 6. File Systems

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Distributed File Systems II

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

CS 318 Principles of Operating Systems

CS370 Operating Systems

Advanced UNIX File Systems. Berkley Fast File System, Logging File System, Virtual File Systems

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

ZFS: Love Your Data. Neal H. Waleld. LinuxCon Europe, 14 October 2014

ML from Large Datasets

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

Improved Solutions for I/O Provisioning and Application Acceleration

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

ECE 598 Advanced Operating Systems Lecture 22

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Let s Make Parallel File System More Parallel

B-trees. It also makes sense to have data structures that use the minimum addressable unit as their base node size.

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Graham vs legacy systems

CS427 Multicore Architecture and Parallel Computing

Parallel File Systems for HPC

RobinHood Policy Engine

Transcription:

Triton file systems - an introduction slide 1 of 28

File systems Motivation & basic concepts Storage locations Basic flow of IO Do's and Don'ts Exercises slide 2 of 28

File systems: Motivation Case #1: I'll just run my code from here 15 min later other users cannot access their files! Case #2: My directory listing takes 10 min while on my laptop it takes 3s. Am I doing something wrong? Case #3: My files are organized in such a fashion that I can't show the important ones to my supervisor when he/she comes to visit. With IO/storage you can (easily) go horribly wrong slide 3 of 28

File systems: Basic concepts What is a file? File = metadata + block data Accessing block data: cat a.txt Showing some metadata: ls -l a.txt slide 4 of 28

File systems: Basic concepts What is a file system? Data store than manages multiple files Ext4 used in Triton/linux Can be local (physical disks on a server) or shared over network Some basic operations ls, cp, scp, rsync, rm, vi, less, slide 5 of 28

File systems: Basic concepts Access and limits Quota enforced, mainly for groups Linux filesystem permissions Some department export filesystem to department workstations (Becs, ICS) Rsync, scp, SSHFS are always the options when transferring files to/from the cluster slide 6 of 28

How storage is organized in Triton slide 7 of 28

File systems: NFS Network filesystem Server/storage SPOF and bottleneck slide 8 of 28

File systems: Lustre Many servers & storage elements Redundancy and scaling Single file can be split between multiple servers! slide 9 of 28

File systems: Network storage User's Home folder (NFS filesystem) /home/<username> -directory Common for every computational node Meant for scripts etc Nightly backup, 1GB quota Work/Scratch (Lustre filesystem) /triton/<department>/work/<username> Common for every computational node Meant for input/output Data Quota varies per project No backups (reliable system with RAID) Has an optimized find function 'lfs find'. In the autumn, when system is updated even more dedicated tools. slide 10 of 28

File systems: Local storage Storage local to compute node /local -directory Best for calculation time storage Copy relevant data after computation, will be deleted on re-install $TMPDIR variable, directory automatically deleted after job. Clean yourself if using something else. Ramdisk for fast IO operations Special location, similar to /local /ramdisk -directory (20GB) Use case: job spends most of its time doing file operations on millions of small files. slide 11 of 28

File systems: Summary Location Type Usage Size/quota /home NFS Home dir 1 Gb /local local Local scratch ~800Gb (varies) /triton Lustre Work/Scratch 200Gb (varies, even several TB's) /ramdisk Ramdisk Local scratch 20GB slide 12 of 28

File systems: Meters of performance Stream I/O and random IOPS Stream measures the speed of reading large sequential data from system IOPS measure random small reads to the system number of metadata/block data accesses To measure your own application, profiling or internal timers needed Rough estimate can be aquired from /proc/<pid>/io or by using strace slide 13 of 28

File systems: Meters of performance Total numbers Device IOPS Stream Sata disk (7.2k) 50-100 50 MB/s SSD disk 3000-10 000 500 MB/s Ramdisk 40 000 5000 MB/s Triton NFS 300 300 MB/s Triton Lustre 20 000 4000 MB/s With 200 concurrent jobs using storage... Device IOPS Stream Sata disk (7.2k) 50-100 50 MB/s SSD disk N/a N/a Triton NFS 1.5 1.5 MB/s Triton Lustre 100 20 MB/s DON'T run jobs from HOME! (NFS) slide 14 of 28

File systems: Advanced Lustre By default striping is turned off lfs getstripe <dir> shows striping lfs setstripe c N <dir> stripe over N targets, -1 means all targets lfs setstripe d <dir> revert to default Use with care. Useful for HUGE files (>100GB) or parallel I/O from multiple compute nodes (MPI-I/O). Real numbers from single client (16 MB IO blocksize for 17 GB): Striping File size Stream Mb/s Off (1) 17 GB 214 2 17 GB 393 4 17 GB 557 max 17 GB 508 max 11 MB 55 max 200 KB 10 slide 15 of 28

File systems: Know your program Questions that you should ask yourself 1. How long does it take to analyze a single dataset? 2. How big is a single input dataset (byte size/file number)? 3. Does your code read/write data that is not useful? 4. How often does the code access the files? Once or constantly? 5. Can my usage cause problem to others? Why these are important? 1. Calculation is faster than disk access. Thus calculation time should be big compared to the data transfer time. 2. If dataset consists of small fragments grouping can lower metadata access. If dataset consists of large files bad read/write buffering can cause loads of IO operations. 3. This should be self-evident. IO and storage are resources just like CPU clock cycles or RAM usage. 4. Constant access slows down the file system. Once is better. If constant access is required, using /local drive will divert the load from network drives. 5. With great power comes great responsibility. slide 16 of 28

Do's and Don'ts slide 17 of 28

File systems: Workflow suggestion Storage Program folder /work /src (symlink) /data Program source /home Data /work Runtime Copy input to disk Work with temp data Copy output to network file system Input /work Temp data /local Program /work Output /work Work in node slide 18 of 28

File systems: Do's and Don'ts Use version control and keep your data tidy Pick one version control system: svn, hg, git Only for source code If you find stuff from your folders you didn't know existed, think about cleaning up. Using consistent naming or data format like HDF you can see your data sets with a glance (no need to use find etc.) Don't use cp as a backup to guard your code/data from your mistakes. slide 19 of 28

File systems: Real use case Storage Laptop 1 dataset to work with Department machine multiple datasets Triton /work important initial data multiple datasets synching with scp Runtime Initial data /work Use R script to choose part of data Input data /work Copy input to disk Temp data /local Work in node Output dataset /work slide 20 of 28

File systems: Real use case Storage Laptop 1 dataset to work with Sync easier with rsync Department machine Triton /work important initial data (small) multiple datasets access through department provided mountpoint Triton /home backup of important initial data (small) Runtime Initial data /work Use Matlab to choose the initial data Work in node Temp data /local Output dataset /work slide 21 of 28

File systems: Do's and Don'ts Lots of small files (+10k, <1MB) Well, bad starting point already in general. Though, sometimes no way to improve (e.g. legacy code) /ramdisk or /local: Best place for these Lustre: Not the best place. With many users local disk provides more IOPS and Stream in general NFS (Home): Very Bad idea, do not run calculation from Home The very best approach: modify you code. Large file(s) instead of many small (e.g. HDF5). Or even no-files-at-all. Sometimes IO due to unnecessary checkpointing. slide 22 of 28

File systems: Do's and Don'ts Databases (sqlite) These can generate a lot of small random reads (=IOPS) /ramdisk or /local: Best place for these Lustre: Not the best place. With many users local disk provides more IOPS and Stream in general NFS (Home): very Bad idea slide 23 of 28

File systems: Do's and Don'ts ls vs ls -la ls in a directory with 1000 files Simple ls is only a few IOPS ls -la in a directory with 1000 files Local fs: 1000+ IOPS (stat() each file!) NFS: a bit more overhead Lustre (striping off) 2000 IOPS (+rpcs) Lustre (striping on) 31000 IOPS! (+rpcs) => Whole Lustre stuck for a while for everyone Use ls -la and variant (ls --color) ONLY when needed slide 24 of 28

Exercise: File systems 15 minutes to proceed, use wiki/google to solve Simple file system operations Use git and clone code repository in /triton/scip/lustre to your work directory. Use 'strace -c' to 'ls <file>' and 'ls -l <file>'. Compare output with eg. grep/diff. Run create_iodata.sh to create sample data. Compare 'strace -c' of 'lfs find <directory>' and 'find <directory>' searches to the 'data' directory. Submit iotest.sh with sbatch. What does the output mean? Try to convert the code to use $TMPDIR. Once you're sure it works, change 'ls' to 'ls -l'. Compare the results. Convert the code to use tar/zip/gzip/bzip2. Can you profile the load caused to network drive? slide 25 of 28

File systems: Best practices When unsure what is the best approach Check above Do's and Don'ts Google? Ask your local Triton support person tracker.triton.aalto.fi Ask your supervisor Trial-and-error (profile it) slide 26 of 28

File systems: Further topics These are not covered here. Ask/Google if you want to learn more. Using Lustre striping (briefly mentioned) HDF5 for small files Benchmarking, what is the share of IO of a job MPI-IO Hadoop slide 27 of 28

Questions or comments regarding Triton file systems? slide 28 of 28