Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Similar documents
Evaluating selected cluster file systems with Parabench

Real-Time I/O-Monitoring of HPC Applications

Tracing and Visualization of Energy Related Metrics

The Optimal CPU and Interconnect for an HPC Cluster

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

A Case for High Performance Computing with Virtual Machines

An Experimental Study of Rapidly Alternating Bottleneck in n-tier Applications

Enhancing Checkpoint Performance with Staging IO & SSD

White Paper. File System Throughput Performance on RedHawk Linux

Performance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources

Tracing Internal Communication in MPI and MPI-I/O

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

Hyper-Threading Influence on CPU Performance

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Work Project Report: Benchmark for 100 Gbps Ethernet network analysis

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

RIGHTNOW A C E

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Efficient Clustering and Scheduling for Task-Graph based Parallelization

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

HPC In The Cloud? Michael Kleber. July 2, Department of Computer Sciences University of Salzburg, Austria

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

Faster Metal Forming Solution with Latest Intel Hardware & Software Technology

Comparing Performance of Solid State Devices and Mechanical Disks

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Architecture Exploration of High-Performance PCs with a Solid-State Disk

Recovering Disk Storage Metrics from low level Trace events

GPU Accelerated Machine Learning for Bond Price Prediction

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems

Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems. Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross

OCTOPUS Performance Benchmark and Profiling. June 2015

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

CP2K Performance Benchmark and Profiling. April 2011

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

VIDAS: OBJECT-BASED VIRTUALIZED D A T A S H A R I N G F O R H I G H PERFORMANCE STORAGE I/O

rcuda: an approach to provide remote access to GPU computational power

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

The Fusion Distributed File System

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Sparse Training Data Tutorial of Parameter Server

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Performance Modeling and Analysis of Flash based Storage Devices

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

NAMD Performance Benchmark and Profiling. January 2015

Best Practices for Setting BIOS Parameters for Performance

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Leveraging Burst Buffer Coordination to Prevent I/O Interference

Bei Wang, Dmitry Prohorov and Carlos Rosales

arxiv: v2 [cs.dc] 2 May 2017

Smooth Scan: Statistics-Oblivious Access Paths. Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell Fraser

VMware VMmark V1.1 Results

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

88X + PERFORMANCE GAINS USING IBM DB2 WITH BLU ACCELERATION ON INTEL TECHNOLOGY

Vess A2000 Series. NVR Storage Appliance. Milestone Surveillance Solution. Version PROMISE Technology, Inc. All Rights Reserved.

ANNEXES. to the. COMMISSION REGULATION (EU) No /.. of XXX

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Massive Data Processing on the Acxiom Cluster Testbed

CPMD Performance Benchmark and Profiling. February 2014

Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016

ibench: Quantifying Interference in Datacenter Applications

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER

Presented by: Nafiseh Mahmoudi Spring 2017

FhGFS - Performance at the maximum

Version 1.0 October Reduce CPU Utilization by 10GbE CNA with Hardware iscsi Offload

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Micron and Hortonworks Power Advanced Big Data Solutions

Cross-layer Optimization for Virtual Machine Resource Management

Tuning Alya with READEX for Energy-Efficiency

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

ISA-L Performance Report Release Test Date: Sept 29 th 2017

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Accelerate Applications Using EqualLogic Arrays with directcache

Case Study 1: Optimizing Cache Performance via Advanced Techniques

SMORE: A Cold Data Object Store for SMR Drives

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Memory Management Strategies for Data Serving with RDMA

Technical Paper. Performance and Tuning Considerations for SAS on the Hitachi Virtual Storage Platform G1500 All-Flash Array

Improving Packet Processing Performance of a Memory- Bounded Application

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

High Performance Computing

NOTE: The Tmpg software package only encodes to the MPEG-2 video format; it will not be included on graphs which refer to other video formats.

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

NexentaVSA for View. Hardware Configuration Reference nv4v-v A

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

Quantifying FTK 3.0 Performance with Respect to Hardware Selection

scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs

Transcription:

Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Motivation Performance improvements In the computation environment: Hardware setup MPI distribution Parallel file system Different tuning options inside the software layers Inside the application: Workload distribution Organization of individual program steps Non-blocking MPI operations

Motivation Performance analysis In order to improve performance, one needs to understand a complex system, deal with a lot of unknowns, and evaluate a possible margin of profit. Benchmarks are needed.

Theory of a non-blocking benchmark 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Theory of a non-blocking benchmark Possible scenarios Consider a program which... produces data iteratively, does not perform write operations on the data for a while and has access to non-blocking functions. blocking client-program Hide or behind the other part respectively. Sequential operations become concurrent operations.

Theory of a non-blocking benchmark Scenarios The most promising scenario is when and can be overlapped perfectly: time non-blocking client-program blocking client-program Any other iterative scenario can only save less time.

Theory of a non-blocking benchmark Evaluation value Non-blocking efficiency ratio: 0.5 duration of non-blocking version ( 1.0) (1) duration of blocking version The actual value will quantify the overhead necessary to execute and concurrently.

Theory of a non-blocking benchmark Desired scenario We want a scenario which in theory promises best results, and requires the file system to do physical as much as possible. time non-blocking client-program -server physical- physical- physical- physical- data of previous write operations data of clientprogram data of clientprogram data of clientprogram

Theory of a non-blocking benchmark Expected blocking version Time on the -server side to free cache Shorter execution time of and parts individually time client-program -server physical- physical- physical- physical- data of previous write operations data of clientprogram time for -server to do write behind data of clientprogram data of clientprogram and free cache.

Theory of a non-blocking benchmark Benchmark goal Overall, the interest is in providing best insight into the actual possible advantages of non-blocking and showing how close the used software can get to the optimal efficiency ratio for an optimal scenario.

The benchmark and results 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

The benchmark and results Benchmark details Benchmark sequence: 1 Compute necessary workload so that in the non-blocking version and take the same amount of time 2 Run non-blocking test 3 Run blocking test Before each phase fill file system buffer.

The benchmark and results Computing environment Hardware: 9 nodes with Two Intel Xeon 2GHz CPUs 1GB DDR-RAM 1-GBit/s-Ethernet-Interfaces 5 nodes used for hosting the pvfs2-servers, each with: two 160 GB S-ATA HDDs set up as a RAID-0.

The benchmark and results Computing environment Software: MPICH2 in version: mpich2-1.0.5p4 non-blocking functions emulated using threads PVFS2 in version 2.6.2 For visualization of the benchmark behavior we used PIOviz, which includes a modified MPE trace and extended Jumpshot.

The benchmark and results Computing environment Software: MPICH2 in version: mpich2-1.0.5p4 non-blocking functions emulated using threads PVFS2 in version 2.6.2 For visualization of the benchmark behavior we used PIOviz, which includes a modified MPE trace and extended Jumpshot.

The benchmark and results Test combinations Executed tests include the combinations of: Number CPUs per benchmark process: 1, 2 Number benchmark processes: 1,2,4 Number -servers: 1,2,4

The benchmark and results Test combinations... 2 CPUs per benchmark process in order to ignore possible overhead. These test showed that theoretical optimum can nearly be achieved: non-blocking efficiency ratio =0.53 (2)

The benchmark and results Non-blocking test 4 CPUs, 4 benchmark processes, 4 -servers

The benchmark and results Blocking test 4 CPUs, 4 benchmark processes, 4 -servers

The benchmark and results Scenario results In average over a series of test runs we have: non-blocking efficiency ratio =0.67 (3) In the non-blocking test, the time increases by a factor of 1.25 and the time increases by a factor of 1.29 compared to the times needed for each in the blocking test.

The benchmark and results Overall results... For all interesting cases with 1CPU per benchmark process we obtained: non-blocking efficiency ratio < 0.75 (4) Allowing for the individual parts to take longer in the non-blocking version.

Future work Conclusion While we only looked at: write operations and scenarios well suited for the use of non-blocking, non-blocking operations show potential to reduce execution times of HPC applications. Performance improvements with non-blocking is possible with the used computing environment. Measurements suggest that short phases can be hidden entirely behind longer phases.

Future work Future work Interesting will also be an analysis of read operations, large scale tests, scenarios where either or take up the bulk of the execution time.