TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

Similar documents
Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Using BigSim to Estimate Application Performance

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Heuristic-based techniques for mapping irregular communication graphs to mesh topologies

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

c 2011 by Ehsan Totoni. All rights reserved.

Adaptive Runtime Support

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Maximizing Throughput on a Dragonfly Network

Charm++ for Productivity and Performance

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Scalable In-memory Checkpoint with Automatic Restart on Failures

PREDICTING COMMUNICATION PERFORMANCE

Modeling a Million-Node Slim Fly Network Using Parallel Discrete-Event Simulation

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines

Adaptive Methods for Irregular Parallel Discrete Event Simulation Workloads

Performance Prediction using Simulation of Large-scale Interconnection Networks in POSE

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Partitioning Low-diameter Networks to Eliminate Inter-job Interference

Tutorial: Application MPI Task Placement

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

From Damaris to CALCioM Mi/ga/ng I/O Interference in HPC Systems

Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer

Burst Buffers Simulation in Dragonfly Network

BigNetSim Tutorial. Presented by Gengbin Zheng & Eric Bohm. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

NAMD Performance Benchmark and Profiling. November 2010

NAMD Performance Benchmark and Profiling. February 2012

Welcome to the 2017 Charm++ Workshop!

Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System

Evaluating the Impact of Spiking Neural Network Traffic on Extreme-Scale Hybrid Systems

Proceedings of the 2017 Winter Simulation Conference W. K. V. Chan, A. D Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds.

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

Enabling Contiguous Node Scheduling on the Cray XT3

InfiniBand-based HPC Clusters

NAMD Performance Benchmark and Profiling. January 2015

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Throughput Models of Interconnection Networks: the Good, the Bad, and the Ugly

POWER- AWARE RESOURCE MANAGER Maximizing Data Center Performance Under Strict Power Budget

Scalable Topology Aware Object Mapping for Large Supercomputers

Optimistic Parallel Simulation of TCP/IP over ATM networks

Dynamic Load Balancing for Weather Models via AMPI

c Copyright by Sameer Kumar, 2005

On the Role of Burst Buffers in Leadership- Class Storage Systems

Scalable Interaction with Parallel Applications

Simulation-Based Performance Prediction for Large Parallel Machines

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

High Performance Computing Course Notes Course Administration

BlueGene/L (No. 4 in the Latest Top500 List)

An Analysis of Object Orientated Methodologies in a Parallel Computing Environment

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

NAMD GPU Performance Benchmark. March 2011

Memory Tagging in Charm++

Evalua&ng Energy Savings for Checkpoint/Restart in Exascale. Bryan Mills, Ryan E. Grant, Kurt B. Ferreira and Rolf Riesen

Scalable Interconnection Network Models for Rapid Performance Prediction of HPC Applications

COCONUT: Seamless Scale-out of Network Elements

c Copyright by Gengbin Zheng, 2005

High Performance Computing Course Notes HPC Fundamentals

CUDA Experiences: Over-Optimization and Future HPC

Automated Load Balancing Invocation based on Application Characteristics

Scalable Dynamic Adaptive Simulations with ParFUM

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

Watch Out for the Bully! Job Interference Study on Dragonfly Network

Lecture 7: Parallel Processing

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Interconnect Your Future

Origin- des*na*on Flow Measurement in High- Speed Networks

Cray XC Scalability and the Aries Network Tony Ford

xsim The Extreme-Scale Simulator

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Resource allocation and utilization in the Blue Gene/L supercomputer

Proceedings of the 2014 Winter Simulation Conference A. Tolk, S. Y. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds.

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Distributed State Es.ma.on Algorithms for Electric Power Systems

Optimization of the Hop-Byte Metric for Effective Topology Aware Mapping

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

c 2012 Harshitha Menon Gopalakrishnan Menon

A Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper

DEBUGGING LARGE SCALE APPLICATIONS WITH VIRTUALIZATION FILIPPO GIOACHIN DISSERTATION

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

The Impact of Optics on HPC System Interconnects

Parallel Computing Platforms

BigSim Parallel Simulator for Extremely Large Parallel Machines

Interconnection Network

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Center Extreme Scale CS Research

Lecture 7: Parallel Processing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Slim Fly: A Cost Effective Low-Diameter Network Topology

Transcription:

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks Bilge Acun, PhD Candidate Department of Computer Science University of Illinois at Urbana- Champaign *Contributors: Nikhil Jain, Abhinav Bhatele, Misbah Mubarak, Christopher D. Carothers, and Laxmikant V. Kale 1

Network Simulation Mo#va#on: Design of the future supercomputers Node architecture InterconnecLon network Predict applicalon performance On exislng non exislng architectures State- of- the art: Discrete event based simulalon Not parallel or scalable Large memory footprints Cannot simulate real HPC workloads SyntheLc communicalon paqerns Skeletonized codes 2

TraceR: Trace Replay A trace- driven simulator OpLmisLc parallel discrete- event simulalon (PDES) for real HPC traffic workloads Outperforms state- of- the- art simulators BigNet- Sim, SST Scalable simulate execulon on half a million nodes in under 10 minutes using 512 cores Op#mis#c simula#on parameter study maximize performance for simulalng real HPC traffic workloads 3

TraceR Components Input PDES parameters AMPI & Charm++ Network Configuration Application Traces from BigSim TraceR CODES ROSS Dimensions, bandwidth, packet size, Accurate packet level network models; Torus, Dragonfly,... Output Performance Prediction PDES Framework 4

BigSim Simulator One of the earliest packet- level HPC network simulator Around 2004 EmulaLon framework Can generate traces using much less cores than actual Built on POSE PDES framework Cause of the slow performance Poor scaling 5

TraceR Components Input PDES parameters AMPI & Charm++ Network Configuration Application Traces from BigSim TraceR CODES ROSS Dimensions, bandwidth, packet size, Accurate packet level network models; Torus, Dragonfly,... Output Performance Prediction PDES Framework 6

BigSim Trace Format Entry for each SequenLal ExecuLon Block (SEB) Time Stamp, Task ID, Name, Dura#on,, Msg ID, Source Node,, Back&Forward Dep. - 1.000000 47 AMPI_Bcast- - Lme:5960 0.000006... $B 46 $F 53 0.001148 48 start- broadcast- - Lme:0 0.000000... $B $F 49-1.000000 49 AMPI_generic- - Lme:3099 0.000003.. $B 48 $F 50 52-1.000000 50 end- broadcast- - Lme:0 0.000000... $B 49 $F 0.001151 51 msgep- - Lme:953 0.000001 $B $F 0.001154 52 RECV_RESUME- - Lme:953 0.000001... $B 49 $F 53-1.000000 60 user_code- - Lme:0 0.000000... $B 59 54 $F 61 7

DeAinitions and Evaluation Metrics Defini#ons: PE: simulated process, logical process (LP) visible to ROSS Task: sequenlal execulon block (SEB) Event: represents an aclon with a Lme- stamp in the PDES Kickoff Event, Message Recv Event, CompleLon Event Reverse Handler: responsible for reversing the effect of an event Metrics: ExecuLon Lme: Lme spent in performing the simulalon Event rate: number of events executed per second (excl. roll backs) Event efficiency: (or rollback efficiency) 8

TraceR: Execution Alow TraceR funclons ROSS Events First task Execute Task Send message to other PEs Remote Message Schedule completion event Completion Event Receive message from other PEs Message Recv Event 9

Experimental Results Scaling results are done with Blue Waters at UIUC PredicLon study results are with Vulcan at LLNL ApplicaLons: 3D Stencil: AMPI applicalon 7 point Jacobi relaxalon on 3D grid 128 x 128 x 128 grid points per MPI process - > 128KB msgs LeanMD: Charm++ applicalon Mini- app version of NAMD molecular dynamics simulalon Mimics short- range force calculalons of NAMD 1.2 million atoms 10

Sequential Comparison of Simulators Skeletonized MPI code 11

Conservative vs. Optimistic 12

TraceR Scaling w/ AMPI app. 13

TraceR Scaling w/ AMPI app. 14

Event EfAiciency 15

Trace Reading Time Insignificant overhead with increasing number of cores! 16

TraceR Performance Prediction w/ Charm++ app. 17

Event Rate: million events/s 18

EfAiciency 19

Ongoing Work and Summary Ongoing & future work: Fat- tree network model Integrated into CODES MulLple job simulalons Effect of mullple jobs in the network More realislc scenario Switch to Charm++ based ROSS from MPI based ROSS TraceR feature highlights: A parallel, trace- driven, scalable network simulator Support for various topologies: Torus, Dragonfly, Fat- tree Simulate AMPI, Charm++ applica#ons Can simulate half a million nodes in minutes 20

Thank you! Paper in progress: Bilge Acun, Nikhil Jain, Abhinav Bhatele, Misbah Mubarak, Christopher D. Carothers, and Laxmikant V. Kale. TraceR: A Parallel Trace Replay Tool for Studying InterconnecLon Networks TraceR source code: hqp://charm.cs.uiuc.edu/gerrit/#/admin/projects/tracer 21

TraceR Scaling w/ Charm++ app. 22