Technology evaluation at CSCS including BeeGFS parallel filesystem. Hussein N. Harake CSCS-ETHZ

Similar documents
Technology Testing at CSCS including BeeGFS Preliminary Results. Hussein N. Harake CSCS-ETHZ

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

An ESS implementation in a Tier 1 HPC Centre

CSCS HPC storage. Hussein N. Harake

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

The Leading Parallel Cluster File System

Extraordinary HPC file system solutions at KIT

SFA12KX and Lustre Update

An Introduction to BeeGFS

HPC Technology Update Challenges or Chances?

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Robin Hood 2.5 on Lustre 2.5 with DNE

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

CSCS Site Update. HPC Advisory Council Workshop Colin McMurtrie, Associate Director and Head of HPC Operations.

MAHA. - Supercomputing System for Bioinformatics

FhGFS - Performance at the maximum

NetApp High-Performance Storage Solution for Lustre

The RAMDISK Storage Accelerator

Experiences with HP SFS / Lustre in HPC Production

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Leonhard: a new cluster for Big Data at ETH

New Storage Technologies First Impressions: SanDisk IF150 & Intel Omni-Path. Brian Marshall GPFS UG - SC16 November 13, 2016

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Parallel File Systems Compared

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

BeeGFS Solid, fast and made in Europe

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

AFM Use Cases Spectrum Scale User Meeting

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

The Spider Center-Wide File System

Lustre usages and experiences

File Systems for HPC Machines. Parallel I/O

Habanero Operating Committee. January

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Parallel File Systems for HPC

Improving overall Robinhood performance for use on large-scale deployments Colin Faber

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

LHConCRAY. Acceptance Tests 2017 Run4 System Report Miguel Gila, CSCS August 03, 2017

Cray XC Scalability and the Aries Network Tony Ford

Crossing the Chasm: Sneaking a parallel file system into Hadoop

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

The Last Bottleneck: How Parallel I/O can improve application performance

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

THOUGHTS ABOUT THE FUTURE OF I/O

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

Comet Virtualization Code & Design Sprint

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

Parallel File Systems. John White Lawrence Berkeley National Lab

FUJITSU PHI Turnkey Solution

GFS Best Practices and Performance Tuning. Curtis Zinzilieta, Red Hat Global Services

HPC Storage Use Cases & Future Trends

BeeGFS Benchmarks on IBM OpenPOWER Servers. Ely de Oliveira October 2016 v 1.0

The Hyperion Project: Collaboration for an Advanced Technology Cluster Testbed. November 2008

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

IBM Spectrum Scale IO performance

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Shared File System Requirements for SAS Grid Manager. Table Talk #1546 Ben Smith / Brian Porter

Challenges in making Lustre systems reliable

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Parallel Virtual File Systems on Microsoft Azure

SUSE. High Performance Computing. Eduardo Diaz. Alberto Esteban. PreSales SUSE Linux Enterprise

Operational Robustness of Accelerator Aware MPI

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Illinois Proposal Considerations Greg Bauer

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

An introduction to BeeGFS. Frank Herold, Sven Breuner June 2018 v2.0

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

Opportunities for container environments on Cray XC30 with GPU devices

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Red Hat Enterprise 7 Beta File Systems

The Last Bottleneck: How Parallel I/O can attenuate Amdahl's Law

Application Performance on IME

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows

IME Infinite Memory Engine Technical Overview

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Shifter: Fast and consistent HPC workflows using containers

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Introduction to High-Performance Computing (HPC)

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Dell TM Terascala HPC Storage Solution

HPC projects. Grischa Bolls

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Implementing Storage in Intel Omni-Path Architecture Fabrics

Porting SLURM to the Cray XT and XE. Neil Stringfellow and Gerrit Renker

Transcription:

Technology evaluation at CSCS including BeeGFS parallel filesystem Hussein N. Harake CSCS-ETHZ

Agenda CSCS About the Systems Integration (SI) Unit Technology Overview DDN IME DDN WOS OpenStack BeeGFS Case Study What is BeeGFS? Test System Layout Tuning Monitoring Benchmark tools Results Next Steps Monitoring and Profiling Q&A CSCS 2017 2

CSCS (Swiss National Supercomputing Centre) Founded in 1991 Enables world-class research with a scientific user lab Available to domestic and international researchers through a transparent, peer-reviewed allocation process. Open to academia and are available as well to users from industry and the business sector. Operated by ETH Zurich and is located in Lugano. CSCS 2017 3

24 years of supercomputers at CSCS 1991 NEC SX3 5.5 GF Adula 1996 NEC SX4 10 GF Gottardo 1999 NEC SX5 64 GF Prometeo 2002 IBM SP4 1.3 TF Venus 2005 Cray XT3 5.8 TF Palu 2006 IBM P5 4.5 TF Blanc 2009-12 Cray XE6 402 TF Monte Rosa 2012-13 Cray XC30 7.7 PF Piz Daint 2014 XC30 1.25 PF Piz Daint extension 4

Data Centre - 2000 sq.m Machine Room - 20 MW of power and Cooling capacity - Lake Water cooling - 700 Liters/s CSCS 2017 5

Overview of Systems Integration (SI) Unit Unit missions: - Managing projects - Relations with Vendors - Evaluating Technologies - Software deployments

Greina Cluster 7

Technology Overview DDN IME Image courtesy of DDN CSCS 2017 8

Tchnology Overview DDN WOS (1) CSCS 2017 9 Image courtesy of DDN

Technology Overview DDN WOS (2) CSCS 2017 10

Technology Overview DDN WOS (3) CSCS 2017 11

Technology Overview - OpenStack Image source: https://www.openstack.org/software/ CSCS 2017 12

Eidos Layout 13

BeeGFS Case Study

What is BeeGFS? Parallel filesystem HPC oriented Used to be called FhGFS Alternative to Lustre and GPFS Developed by Fraunhofer Open-source Support delivered by ThinkParq Image courtesy of BeeGFS 15

Basic Features of BeeGFS Supports failover for data and Metadata using applications like Peacemaker, heartbeat Replication failover mechanism Supports Multiple data and metadata on both servers and targets Supports quota Uses Robin-hood to scan the entire filesystem Beegfs on demand filesystem (BeeOND) Easy to deploy and manage Support X86 and Open-power platform CSCS 2017 16

Easy to deploy 17

BeeOND - Create a filesystem on Demand - Uses the hard drive / SSDs on every compute node - Filesystem get created by submitting a job to the schedule We are working on confirming SLURM support - Memory could used instead of SSDs - We used 20 SSDs on 20 nodes for our tests CSCS 2017 18

Benefits of BeeOND Benefits from unused space No impact on the parallel filesystem Real utilization of the high speed network Filesystem scales with the compute nodes Open point: What is the overhead on the compute nodes? CSCS 2017 19

Test System Layout One couplet (two controllers) 4 * FDR Links Two X86 servers One enclosure 60 drives DDN 7700 6 SSDs one raid volume 6 * 9 Raid 5 volumes 2 * FDR Links Dual sockets SB 128GB memory Fabric 1 * FDR Links CSCS 2017 20

Tuning the servers echo 5 > /proc/sys/vm/dirty_background_ratio echo 20 > /proc/sys/vm/dirty_ratio echo 50 > /proc/sys/vm/vfs_cache_pressure echo 262144 > /proc/sys/vm/min_free_kbytes echo always > /sys/kernel/mm/transparent_hugepage/enabled echo always > /sys/kernel/mm/transparent_hugepage/defrag for dev in dm-0 dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 do echo deadline > /sys/block/$dev/queue/scheduler echo 4096 > /sys/block/$dev/queue/nr_requests echo 32768 > /sys/block/$dev/queue/read_ahead_kb echo 32767 > /sys/block/$dev/queue/max_sectors_kb done echo performance tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo 1 > /proc/sys/vm/zone_reclaim_mode Documentation for the tuned parameters: https://www.kernel.org/doc/documentation/sysctl/vm.txt https://access.redhat.com/solutions/46111 http://www.slideshare.net/rampalliraj/linux-kernel-io-schedulers?from_action=save CSCS 2017 21

Monitoring clients activities (1) CSCS 2017 22

Monitoring servers activities (2) CSCS 2017 23

Benchmark tools Mdtest measuring metadata https://sourceforge.net/projects/mdtest/ IOzone throughput read and write http://www.iozone.org CSCS 2017 24

Iozone results on /beegfs Test running: Children see throughput for 64 initial writers = 5032700.90 kb/sec Min throughput per process = 63754.09 kb/sec Max throughput per process = 103798.58 kb/sec Avg throughput per process = 78635.95 kb/sec Min xfer = 12880896.00 kb Test running: Children see throughput for 64 rewriters = 4996297.63 kb/sec Min throughput per process = 68781.82 kb/sec Max throughput per process = 90666.23 kb/sec Avg throughput per process = 78067.15 kb/sec Min xfer = 16473088.00 kb Test running: Children see throughput for 64 readers = 4225632.91 kb/sec Min throughput per process = 40047.24 kb/sec Max throughput per process = 77678.61 kb/sec Avg throughput per process = 66025.51 kb/sec Min xfer = 10813440.00 kb Test running: Children see throughput for 64 re-readers = 4253662.00 kb/sec Min throughput per process = 56998.73 kb/sec Max throughput per process = 76042.87 kb/sec Avg throughput per process = 66463.47 kb/sec Min xfer = 15729664.00 kb CSCS 2017 25

Mdtest results on BeeOND Directory creation Directory Stat Directories per second 120000 100000 80000 60000 40000 20000 0 1 2 4 8 16 20 Numer of MDSs Directories per second 900000 800000 700000 600000 500000 400000 300000 200000 100000 0 1 2 4 8 16 20 Numer of MDSs Directories per second 160000 140000 120000 100000 80000 60000 40000 20000 0 Directory Removal Stat 1 2 4 8 16 20 Numer of MDSs CSCS 2017 26

Mdtest results on BeeOND File Creation File Stat 300000 900000 Files per second 250000 200000 150000 100000 50000 Files per second 800000 700000 600000 500000 400000 300000 200000 100000 0 1 2 4 8 16 20 Numer of MDSs 0 1 2 3 4 5 6 Numer of MDSs File removal 250000 Files per second 200000 150000 100000 50000 0 1 2 3 4 5 6 Numer of MDSs CSCS 2017 27

Next steps Scaling on bigger cluster Verifying the fail over procedures Verify the BeeOND overhead on compute nodes Using Nvme instead of SSDs Using tmpfs Create BeeOND through SLURM jobs Use Robinhood to scan millions of files CSCS 2017 28

Check-MK Monitoring and Profiling 29

CPU Utilization 30

Q&A hussein@cscs.ch 31