xsim The Extreme-Scale Simulator
|
|
- Bernadette George
- 5 years ago
- Views:
Transcription
1 xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa BSC, 28 Feb 2014
2 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of processors per node. Limited application scalability due to sequential parts, synchronizing communication and other bottlenecks Investigating performance of parallel applications at scale is an important component of HPC hardware/software co-design Behaviour on future architectures Performance impact of architecture choices
3 Overview Several existing simulators include JCAS, BigSim, Dimemas, MuPi Limitations (Run time, no of concurrent threads executed,..) Highly scalable solution trade off accuracy in exchange for scalability Nodes oversubscribed for simulation Highly accurate simulations are extremely slow and less scalable The Extreme-scale Simulator permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing.
4 Overview Parallel discrete event simulation (PDES) to emulate the behaviour of various architecture models Execution of real applications, algorithms or their models atop a simulated HPC environment for: Performance evaluation, including identification of resource contention and underutilization issues Investigation at extreme scale, beyond the capabilities of existing simulation efforts S. Boehm and C. Engelmann. xsim: The Extreme-Scale Simulator. HPCS 2011, Istanbul, Turkey, July 4-8, 2011.
5 Overview Combining highly oversubscribed execution, a virtual MPI, and a timeaccurate PDES (Parallel discrete event simulation) PDES uses the native MPI and simulates virtual processors The virtual processors expose a virtual MPI to applications Multithreaded MPI implementation needed (e.g. Open MPI --enable-mpi-threadmultiple ) 2010 IEEE Cluster Co-Design Workshop
6 Overview The simulator is a library Utilizes PMPI to intercept MPI calls and to hide the PDES Easy to use: Replace the MPI header Compile and link with the simulator library Run the MPI program Support for C and Fortran MPI applications 2010 IEEE Cluster Co-Design Workshop
7 Overview xsim is designed like a traditional performance tool Interposition library sitting between MPI application and MPI library Uses simulated wall clock time for measurement Performance data extracted based on processor and network model MPI performance tool interface (PMPI) Supports Simulated MPI point-to-point communication (essential calls) Simulated MPI data types, groups, communicators, collective communication (full) 81 simulated MPI calls for each C and Fortran ULFM MPI extensions
8 Comparison to Dimemas Online simulator, change in model requires rerun Batch simulations through scripts, configuration files, command line options Application model available Oversubscribes nodes Larger simulations than underlying system possible Support of multi threaded MPI implementations Fault Tolerance support through Open MPI ULFM Version with locks available, albeit significantly slower No two calls in MPI comm at the same time Non blocking calls -> blocking calls
9 Simulation Models Processor model Based on actual execution time of underlying hardware Scaled up for simulated processor speed Heterogeneous cores with differing speeds Support for various network architecture models Analyze existing hardware conditions / experiment with differing architectures Latency and bandwidth restrictions Hierarchical combinations Network on chip Network on node Sender/Receiver contention simulation Full contention not supported due to scalability reasons
10 Simulation Models Application model Similar to MPI trace replays Same timing and communication behaviour Certain resources not needed to scale with simulation Memory usage Advances virtual time for application between MPI calls Execute MPI calls without actually sending data (no need for buffers) Operating system noise simulation File system model Currently in development Read/Write delay, access time, congestion,
11 Network Models Unidirectional Ring Star Tree Mesh
12 Network Models Torus Twisted Torus
13 Network Models Twisted Torus with Toroidal Jump Twisted Torus with Toroidal Degree
14 General Usage of xsim Add header files #include xsim-c.h #include xsim-f.h Recompile and link with library Library flag -lxsim Programming language interface flags -lxsim-c or -lxsim-f Run application in the simulator mpirun -np <real process count> <application> <application args> -xsim-np <virtual process count> <xsim args>
15 Examples Hello World 936 xsim runtime 936 cores Simulated Time Scaling hello world from 1000 to 100,000,000 cores Native system : 12-core 2-processor 39-node Gig. Ethernet Simulated system : 100,000,000 processor Gigabit Ethernet xsim runs on up to 936 AMD Opteron cores and 2.5 TB RAM 468 or 936 cores needed for 100,000,000 simulated processes 100,000,000 x 8 kb = 800 GB in virtual MPI process stack
16 Examples Basic Network Model Model allows to define network architecture, latency and bandwidth Basic star network Model can be set to 0μs and Gbps as baseline 50μs and 1Gbps roughly represented the native test environment 4 Intel dual-core 2.13GHz nodes with 2GB of memory each Ubuntu bit Linux Open MPI with multi-threading support 2010 IEEE Cluster Co-Design Workshop
17 Example Processor Model Model allows to scale relative speed to different processor Basic scaling model Model can be set to 1.0x for baseline numbers MPI hello world scales to 1M+ VPs on 4 nodes with 4GB total stack (4kB/VP) Simulation (application) Constant execution time <1024 VPs: Noisy clock Simulator >256 VPs: Output buffer issues Simulated 0µs/ Gbps/1.0x xsim run time 0µs/ Gbps/1.0x 2010 IEEE Cluster Co-Design Workshop
18 Example Basic PI Monte Carlo Solver Network model: Star, 50μs and 1Gbps Processor model 1x (32kB stack/vp) 0.5x (32kB stack/vp) Simulated time 50µs/1Gbps/1.0x Simulated time 50µs/1Gbps/0.5x xsim run time 50µs/1Gbps/1.0x xsim run time 50µs/1Gbps/0.5x Simulation Perfect scaling Simulator 2010 IEEE Cluster Co-Design Workshop <= 8 VPs: 0% overhead on the 8 processor cores >= 4096 VPs: comm. load dominates
19 Examples NAS Parallel Benchmark Scaling CG and EP class B problems 1 to 128 simulated cores Native system: 4 core, 2 processor, 16 node
20 Examples NAS Parallel Benchmark CG.B Simulated time CG.B Total run time EP.B Simulated time EP.B Total run time Scaling CG and EP class A problems CG 1 to 4096 simulated cores EP 1 to cores Native system: 4 core, 2 processor, 16 node
21 Examples MCMI Core Scaling 960 Core system 240 cores for simulation due to memory bandwidth restrictions
22 Examples MCMI Problem Scaling Linear behaviour up to 2000x2000 matrix size Slight degradation for larger problem sizes
23 Examples MCMI MPI Message Count Scaling Simulator also gathers MPI statistics Linear increase of exchanged messages
24
25 Fault Tolerance Properties Fault tolerance is a property of a program, not of an API specification or an implementation. Within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
26 Advanced Features Resilience and Fault Tolerance xsim fully supports error handling within simulated MPI Default MPI error handlers User-defined MPI error handlers MPI_Abort() Simulated abort terminates simulation and provides Performance results Source of abort Time of abort
27 Advanced Features Resilience and Fault Tolerance Developing and debugging of FT applications Permits injection of MPI process failures Propagate/detection/notification of failures within simulation Handle application-level checkpoint/restart Observation of application behaviour and performance under failure possible Support for User-Level Failure Mitigation (ULFM) extension Investigate and develop Algorithm-Based Fault Tolerance (ABFT) applications xsim is the first performance tool to support ULFM and ABFT
28 User-Level Failure Mitigation (ULFM) Fault-tolerant MPI extension Proposed by MPI 3.0 Fault Tolerance Working Group To be presented and voted upon in the MPI forum these coming months for integration in upcoming MPI 3.1 standard Minimal set of changes necessary for applications and libraries to include fault tolerance techniques and to construct more forms of fault tolerance (transactions, strongly consisten collectives, etc.)
29 User-Level Failure Mitigation (ULFM) Three main concepts Simplicity API easy to use and understand Flexibility API to allow for varied fault tolerant models to be built as external libraries Absence of deadlock No MPI call (point-to-point or collective) can block indefinitely after a failure Calls must either succeed or raise an MPI error Default error handler needs to be changed to use ULFM On at least MPI_COMM_WORLD from MPI_ERRORS_ARE_FATAIL to MPI_ERRORS_RETURN or custom MPI Errorhandler
30 User-Level Failure Mitigation (ULFM) Exceptions raised MPI_ERR_PROC_FAILED MPI_ERR_PROC_FAILED_PENDING MPI_ERR_REVOKED Acknowledge MPI_Comm_failure_ack MPI_Comm_failure_get_acked Handling MPI_Comm_shrink MPI_Comm_revoke MPI_Comm_agree MPI_Comm_iagree
31 ULFM in xsim MPI_Comm_revoke() Linear broadcast of failure notification through the simulated runtime No matching receive Releases any waited on send or receive request with MPI_ERR_PROC_FAILED MPI_ERR_PROC_FAILED_PENDING MPI_Comm_shrink() Two-phase commit protocol to establish the list of failed MPI ranks Fault-tolerant linear reduction and broadcast operations MPI_Comm_agree() Agreement on a single value (logical AND operation by live members) Fault-tolerant linear MPI_Allreduce() implementation MPI_Comm_failure_ack() and MPI_Comm_failure_get_acked() Failure registry per rank and per communicator with low memory overhead (bit arrays)
32
33
34
35 Conclusions The Extreme-scale Simulator (xsim) is a performance investigation toolkit Uses oversubscription to model systems larger than underlying hardware Supports processor, network, application and noise models File system model under development First performance toolkit to support MPI process failure injection, checkpoint/restart and ULFM Forecast behaviour on varying systems possible Time and resource saving via simulation
Scalable and Fault Tolerant Failure Detection and Consensus
EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann
More informationImproving the Performance of the Extreme-scale Simulator
Improving the Performance of the Extreme-scale Simulator Christian Engelmann and Thomas Naughton Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN, USA engelmannc@ornl.gov
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationMPI versions. MPI History
MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationUser Level Failure Mitigation in MPI
User Level Failure Mitigation in MPI Wesley Bland Innovative Computing Laboratory, University of Tennessee wbland@eecs.utk.edu 1 Introduction In a constant effort to deliver steady performance improvements,
More informationMPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationMPI History. MPI versions MPI-2 MPICH2
MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention
More informationTools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley
Tools and Methodology for Ensuring HPC Programs Correctness and Performance Beau Paisley bpaisley@allinea.com About Allinea Over 15 years of business focused on parallel programming development tools Strong
More informationA Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis.
A Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis. Arnaud Defrance, Stéphane Vialle, Morgann Wauquier Firstname.Lastname@supelec.fr Supelec, 2 rue Edouard
More informationEN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationA Global Operating System for HPC Clusters
A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationEvaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More informationTopology Awareness in the Tofu Interconnect Series
Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0 Introduction Networks
More informationProbabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications
International Conference on Parallel Architectures and Compilation Techniques (PACT) Minneapolis, MN, Sep 21th, 2012 Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications Ignacio
More informationUsing Performance Tools to Support Experiments in HPC Resilience
Using Performance Tools to Support Experiments in HPC Resilience Thomas Naughton 1,2, Swen Böhm 1, Christian Engelmann 1, and Geoffroy Vallée 1 1 Computer Science and Mathematics Division Oak Ridge National
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationCISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan
CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous
More informationA Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED
A Breakthrough in Non-Volatile Memory Technology & 0 2018 FUJITSU LIMITED IT needs to accelerate time-to-market Situation: End users and applications need instant access to data to progress faster and
More informationSTAR-CCM+ Performance Benchmark and Profiling. July 2014
STAR-CCM+ Performance Benchmark and Profiling July 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: CD-adapco, Intel, Dell, Mellanox Compute
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationRollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello
Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationFDS and Intel MPI. Verification Report. on the. FireNZE Linux IB Cluster
Consulting Fire Engineers 34 Satara Crescent Khandallah Wellington 6035 New Zealand FDS 6.7.0 and Intel MPI Verification Report on the FireNZE Linux IB Cluster Prepared by: FireNZE Dated: 11 August 2018
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationCluster Network Products
Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster
More informationUnifying UPC and MPI Runtimes: Experience with MVAPICH
Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More informationCheckpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University
MVAPICH Users Group 2016 Kapil Arya Checkpointing with DMTCP and MVAPICH2 for Supercomputing Kapil Arya Mesosphere, Inc. & Northeastern University DMTCP Developer Apache Mesos Committer kapil@mesosphere.io
More informationMaximizing Memory Performance for ANSYS Simulations
Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance
More informationSamsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions
Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song June 24, 2016 Table of Contents 1
More informationExploring Use-cases for Non-Volatile Memories in support of HPC Resilience
Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience Onkar Patil 1, Saurabh Hukerikar 2, Frank Mueller 1, Christian Engelmann 2 1 Dept. of Computer Science, North Carolina State University
More informationDr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011 A function level simulator for parallel applications on peta scale machine An
More informationPortable Power/Performance Benchmarking and Analysis with WattProf
Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationResilience Design Patterns: A Structured Approach to Resilience at Extreme Scale
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale Saurabh Hukerikar Christian Engelmann Computer Science Research Group Computer Science & Mathematics Division Oak Ridge
More informationExtreme I/O Scaling with HDF5
Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1 Outline Brief overview of
More informationREMEM: REmote MEMory as Checkpointing Storage
REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of
More informationEReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications
EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications Sourav Chakraborty 1, Ignacio Laguna 2, Murali Emani 2, Kathryn Mohror 2, Dhabaleswar K (DK) Panda 1, Martin Schulz
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationRuntime Address Space Computation for SDSM Systems
Runtime Address Space Computation for SDSM Systems Jairo Balart Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 2 Outline Introduction Inspector/executor
More informationOpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa
OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationIntel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*
Intel Cluster Toolkit Compiler Edition. for Linux* or Windows HPC Server 8* Product Overview High-performance scaling to thousands of processors. Performance leadership Intel software development products
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationAggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationCapriccio: Scalable Threads for Internet Services
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley Presenter: Cong Lin Outline Part I Motivation
More informationEnhancing Checkpoint Performance with Staging IO & SSD
Enhancing Checkpoint Performance with Staging IO & SSD Xiangyong Ouyang Sonya Marcarelli Dhabaleswar K. Panda Department of Computer Science & Engineering The Ohio State University Outline Motivation and
More informationAcuSolve Performance Benchmark and Profiling. October 2011
AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox, Altair Compute
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationAdvantages to Using MVAPICH2 on TACC HPC Clusters
Advantages to Using MVAPICH2 on TACC HPC Clusters Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC) University of Texas at Austin Wednesday 27 th August, 2014 1 / 20 Stampede
More informationThe Red Storm System: Architecture, System Update and Performance Analysis
The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationOS Agnostic Sandboxing Using Virtual CPUs
Berlin Institute of Technology FG Security in Telecommunications OS Agnostic Sandboxing Using Virtual CPUs Spring 6 - SIDAR Graduierten-Workshop über Reaktive Sicherheit Weiss Matthias Lange, March 21st,
More informationThe Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011
The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities
More informationPerformance comparison between a massive SMP machine and clusters
Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationNetwork Design Considerations for Grid Computing
Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom
More informationECE 574 Cluster Computing Lecture 23
ECE 574 Cluster Computing Lecture 23 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 December 2015 Announcements Project presentations next week There is a final. time. Maybe
More informationAltair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015
Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute
More informationPerformance Study of the MPI and MPI-CH Communication Libraries on the IBM SP
Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationFault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013
Fault Tolerant Runtime Research @ ANL Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013 Brief History of FT Checkpoint/Restart (C/R) has been around for quite a while Guards against
More informationPerformance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology
Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp
More informationLow-Latency Network-Scalable Byzantine Fault-Tolerant Replication 12th EuroSys Doctoral Workshop (EuroDW 2018)
Low-Latency Network-Scalable Byzantine Fault-Tolerant tion 12th EuroSys Doctoral Workshop (EuroDW 2018) Ines Messadi, TU Braunschweig, Germany, 2018-04-23 New PhD student (Second month) in the distributed
More informationNAMD Performance Benchmark and Profiling. January 2015
NAMD Performance Benchmark and Profiling January 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationTrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa
TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis
More informationTuning Alya with READEX for Energy-Efficiency
Tuning Alya with READEX for Energy-Efficiency Venkatesh Kannan 1, Ricard Borrell 2, Myles Doyle 1, Guillaume Houzeaux 2 1 Irish Centre for High-End Computing (ICHEC) 2 Barcelona Supercomputing Centre (BSC)
More informationMethod-Level Phase Behavior in Java Workloads
Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS
More informationThe Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing
The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task
More informationCRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer
More informationShort-term Memory for Self-collecting Mutators. Martin Aigner, Andreas Haas, Christoph Kirsch, Ana Sokolova Universität Salzburg
Short-term Memory for Self-collecting Mutators Martin Aigner, Andreas Haas, Christoph Kirsch, Ana Sokolova Universität Salzburg CHESS Seminar, UC Berkeley, September 2010 Heap Management explicit heap
More informationLarge Scale Debugging of Parallel Tasks with AutomaDeD!
International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi Todd
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationDistributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello
Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationCapriccio : Scalable Threads for Internet Services
Capriccio : Scalable Threads for Internet Services - Ron von Behren &et al - University of California, Berkeley. Presented By: Rajesh Subbiah Background Each incoming request is dispatched to a separate
More informationVirtualization, Xen and Denali
Virtualization, Xen and Denali Susmit Shannigrahi November 9, 2011 Susmit Shannigrahi () Virtualization, Xen and Denali November 9, 2011 1 / 70 Introduction Virtualization is the technology to allow two
More informationBenchmark Generation and Simulation at Extreme Scale
Benchmark Generation and Simulation at Extreme Scale Mahesh Lagadapati Dept. of Computer Science, North Carolina State University, Raleigh, NC 27695-7534, Email: mlagada@ncsu.edu Frank Mueller Dept. of
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationMVAPICH2 vs. OpenMPI for a Clustering Algorithm
MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore
More informationSami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1
Acknowledgements: Petra Kogel Sami Saarinen Peter Towers 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Motivation Opteron and P690+ clusters MPI communications IFS Forecast Model IFS 4D-Var
More informationReduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection
Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a
More informationExercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing
Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationHPC Performance in the Cloud: Status and Future Prospects
HPC Performance in the Cloud: Status and Future Prospects ISC Cloud 2012 Josh Simons, Office of the CTO, VMware 2009 VMware Inc. All rights reserved Cloud Cloud computing is a model for enabling ubiquitous,
More informationContents Overview of the Compression Server White Paper... 5 Business Problem... 7
P6 Professional Compression Server White Paper for On-Premises Version 17 July 2017 Contents Overview of the Compression Server White Paper... 5 Business Problem... 7 P6 Compression Server vs. Citrix...
More information