Improving the Scalability of Comparative Debugging with MRNet
|
|
- Kathryn Stephens
- 5 years ago
- Views:
Transcription
1 Improving the Scalability of Comparative Debugging with MRNet Jin Chao MeSsAGE Lab (Monash Uni.) Cray Inc. David Abramson Minh Ngoc Dinh Jin Chao Luiz DeRose Robert Moench Andrew Gontarek
2 Outline Assertion-based comparative debugging The architecture of Guard Improving the scalability of Guard with MRNet Performance evaluation Conclusion and future work
3 Assertion-based Comparative Debugging
4 Challenge faced by parallel debugging (1) Cognitive challenge Large dataset on popular supercomputer: Cray Blue Water: > 1.5PB aggregated memory > 380,000 CPU cores A large set of scientific data that is non-readable to human : Multi-dimensional Floating-point What is the correct state?
5 Challenge faced by parallel debugging (2) Control-flow based parallel debuggers Limitations of visualization tools Errors in visualized presentations are hard to detect
6 Comparative debugging Data-centric debugging: focusing the data set in parallel programs Comparing state between programs Porting codes across different platforms Re-writing codes with different languages Software evolution: Modifying/improving existing codes Comparing state with users expectations Invariants based on the properties of scientific modelling or mathematical theories Verifying the correctness of computing status
7 Assertion for Comparative debugging An assertion is a statement about an intended state of a system s component. Assertion in Guard: An ad hoc debug-time assertion Simple assertion: 31: for(j = 0; j < n; j++) { 32: for (i = 0; i < m; i++) { 33: p[j][j] = 5000; }} Incorrect code 31: for(j = 0; j < n; j++) { 32: for (i = 0; i < m; i++) { 33: p[j][i] = 5000; }} Correct code assert $a::p@ source.c :33 = 5000
8 Assertion for Comparative debugging An assertion is a statement about an intended state of a system s component. Assertion in Guard: An ad hoc debug-time assertion Simple assertion: assert $a::var@ source.c :34 = 1024 Comparative assertions Observing the divergence in the key data structures as the programs execute assert $a::p_array@ prog1.c :34 = $b::p_array@ prog2.c :37 Statistical assertions Verifying the statistical properties of scientific modelling or mathematical theories: standard deviation histogram
9 Example of statistical assertion: histogram Histogram: user-defined abstract data model Creating the model with the two phase operation create $model=randset (Gaussian, , 0.05) set reduce histogram(1000, 0.0, 1.0) assert < 0.02 estimate operator: ~ 2 goodness of fit test
10 Implementation of assertions An assertion is compiled into a dataflow graph Dataflow machine set reduce sum assert $a::p_array@ prog.c :34 > 1,000,000 PROCSET($a) Set BKPT Go BKPT Hit Get VAR Reduce Sum(p_array) 1,000,000 Compare Exit
11 Dataflow graphs
12 The Architecture of Guard
13 Features of Guard (CCDB) A general parallel debugger Supporting C, FORTRAN and MPI A relative debugger Comparative assertions and statistical assertions Client/Server structure: Machine independent command line client Visual Studio Client for Windows SUN One Studio and IBM Eclipse Supporting servers from different architecture: Unix: SUN (Solaris), x86 (Linux), IBM RS6000 (AIX) Windows: Visual Studio.NET Architecture Independent Format (AIF)
14 The architecture of Guard CLI Front-end: Relative debugger Debug Client Network: Socket s 0 s 1 s 2 s n Back-end: GDB GDB GDB GDB p 0 p 1 p 2 p n
15 Features of MRNet Tree-Based Overlay Networks (TBONs) Scalable broadcast and gather Custom data aggregation
16 General purpose API of MRNet User-defined tree topology Topology file: k-ary, k-nomial and tailored layout Communicator A set of back-ends Stream A logical data channel over a communicator Multicast, gather and custom reduction Packet Collection of data Filter Modify data transferred across it WaitForAll, TimeOut, NoWait Startup: launching back-end processes
17 Improving the scalability of Guard with MRNet
18 The architecture of Guard with MRNet Command Line Interface Front-end: Comparative debugger Debug Client C Wrapper MRNet Front-end CP Guard filter CP CP s 0 s 1 s 2 MRNet BE s n Back-end: GDB GDB GDB GDB p 0 p 1 p 2 p n
19 Communication with MRNet Communication patterns in Guard General parallel debugger Topology Balanced, K-nomial Placement of MRNet internal nodes Requiring no additional resources procset: Communicator Channels for commands, events and I/O Command stream: synchronous channel Event and I/O Stream: asynchronous channel Aggregating redundant messages: WaitForAll filter: synchronous channel TimerOut filter: asynchronous channel
20 Creating a communication tree of MRNet Invoking a debug session on Cray: Front-end: Back-end: p 0 aprun apinit p n GDB Guard Client launch helper agent s 0 GDB s n MRNet FE
21 Comparative assertion with MRNet (1) assert prog1.c :34 = $b::p_array@ prog2.c :37 Front-end: Guard client FE 0 FE 1 CP CP CP CP CP CP Back-end: s 0 BE BE s m s 1 s n s 0 s 1
22 Comparative assertion with MRNet (2) Centralized comparison: s 0 s 1 s 2 s 3 Reconstruct and compare Front-end: s 2 s 3 MRNet MRNet Back-end: p 0 p 1 p 0 p 1 p 2 p 3 ds 0 ds 1 ds 2 ds 3 ds 0 ds 1 ds 2 ds 3
23 Data Reconstruction Currently support regular decompositions Blockmap: Assertions require the debugger to understand data decomposition blockmap test(p::v) define distribute(block, *) define data(1024, 1024) end p_array P 1 P 2 P 3 s_array P 4
24 Comparative assertion with MRNet (3) Point-to-point comparison: Comparison results Front-end: MRNet MRNet r 0 r 1 r 2 r 3 Back-end: p 0 p 1 p 0 p 1 p 2 p 3 d 0 d d 1 d 2 d d 3 d 0 d 1 d 2 d 3
25 Statistical assertion with MRNet Standard deviation: two phase operation Parallel: calculate a set of primary statistics Aggregation: form a full statistical model Front-end: Aggregation Guard client CP CP CP Back-end: s 0 s 1 s n ps 0 ps 1 ps n
26 Performance Evaluation
27 Performance Evaluation The configuration of MRNet (3.1.0) : A balanced topology Fan-out: 64 Test bed: Hera Cray XE6 Gemini 1.2 system 752 computing nodes, each with 32GB of memory. 21,760 CPU cores totally General Parallel Debugger SP (Scalar Pentagidgonal) of NAS Parallel Benchmarks (NPB) Relative Debugger Comparative assertion: data examples from WRF Statistical assertion: molecular dynamics simulation
28 Time(second) Scalability of Invoke Command Total-MRNet MRNet Instantiation MRNet Attachment MRNet Overall The Number of Parallel Processes x 10 4
29 Time (second) Latency of Debugging Commands 10 0 All bkpt (MRNet) All step (MRNet) The Number of Parallel Processes
30 Time (second) Message Aggregation A memory buffer of 256 bytes was added into the SP program. Its value was inspected under different degrees of aggregation (DoA) DoA = 1(MRNet) DoA = 0.1(MRNet) DoA = 0.01(MRNet) The Number of Parallel Processes
31 Comparative assertion 320 KBytes*10,000=3GB. The size of data is from WRF, a production climate model.
32 Time (second) Statistical assertion Molecular dynamics simulation: 209*10 6 atoms The atomistic mechanism of fracture accompanying structural phase transformation in AIN ceramic under hypervelocity impact Overall-Histogram Reduction-Histogram Overall-Stdev Reduction-Stdev The Number of Parallel Processes
33 Conclusion and Future Work MRNet improves the scalability of Guard >20,000 debug servers The overhead of creating an MRNet tree Attachment: Instantiation: one BE process per core How to take advantage of computing capability of MRNet Comparison across multiple trees? Building a tree across multiple aprun? Programming filter for handling aggregations of statistical assertions The best way of maintaining C wrapper for C++ interface?
34 Related publications Dinh M.N., Abramson D., Jin C., Gontarek A., Moench R., and DeRose L., Debugging Scientific Applications With Statistical Assertion, in International Conference on Computational Science (ICCS), Omaha, USA, Dinh M.N., Abramson D., Jin C., Gontarek A., Moench R., and DeRose L., Scalable parallel debugging with statistical assertions. ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP) 2012, New Orleans, USA. (Poster session) Jin C., Abramson D., Dinh M.N., Gontarek A., Moench R., and DeRose L., A Scalable Parallel Debugging Library with Pluggable Communication Protocols, the 12th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2012),Ottwa, Canada, May Dinh M.N., Abramson D., M.N., Kurniawan, Jin C., Moench R., and DeRose L., Assertion based parallel debugging, the 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2011), Newport Beach,USA, May Abramson, D., Dinh, M.N., Kurniawan, D., Moench, B. and DeRose, L. Data Centric Highly Parallel Debugging, 2010 International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, USA, June Abramson D., Foster, I., Michalakes, J. and Sosic R., Relative Debugging and its Application to the Development of Large Numerical Models, Proceedings of IEEE Supercomputing 1995, San Diego, December 95. (Best paper award)
35 Thanks and Questions?
It s not my fault! Finding errors in parallel codes 找並行程序的錯誤
It s not my fault! Finding errors in parallel codes 找並行程序的錯誤 David Abramson Minh Dinh (UQ) Chao Jin (UQ) Research Computing Centre, University of Queensland, Brisbane Australia Luiz DeRose (Cray) Bob Moench
More informationExtending the Eclipse Parallel Tools Platform debugger with Scalable Parallel Debugging Library
Available online at www.sciencedirect.com Procedia Computer Science 18 (2013 ) 1774 1783 Abstract 2013 International Conference on Computational Science Extending the Eclipse Parallel Tools Platform debugger
More informationEclipse Guard: Relative Debugging in the Eclipse Framework
Eclipse Guard: Relative Debugging in the Eclipse Framework David Abramson, Tim Ho, Clement Chu and Wojtek Goscinski School of Computer Science and Software Engineering, Monash University, Clayton, VIC
More informationData Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm
Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm Benjamin Welton and Barton Miller Paradyn Project University of Wisconsin - Madison DRBSD-2 Workshop November 17 th 2017
More informationTree-Based Density Clustering using Graphics Processors
Tree-Based Density Clustering using Graphics Processors A First Marriage of MRNet and GPUs Evan Samanas and Ben Welton Paradyn Project Paradyn / Dyninst Week College Park, Maryland March 26-28, 2012 The
More informationPhilip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory
Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory A Tree-Based Overlay Network (TBON) like MRNet provides scalable infrastructure for tools and applications MRNet's
More informationScalable Tool Infrastructure for the Cray XT Using Tree-Based Overlay Networks
Scalable Tool Infrastructure for the Cray XT Using Tree-Based Overlay Networks Philip C. Roth, Oak Ridge National Laboratory and Jeffrey S. Vetter, Oak Ridge National Laboratory and Georgia Institute of
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationDesigning Next Generation Data-Centers with Advanced Communication Protocols and Systems Services
Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory
More informationHigh Throughput, Low Impedance e-science on Microsoft Azure
High Throughput, Low Impedance e-science on Microsoft Azure David Abramson & Blair Bethwaite Monash e-science and Grid Engineering Lab (MeSsAGE Lab) Faculty of Information Technology Monash e-research
More informationAggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer
More informationThe Red Storm System: Architecture, System Update and Performance Analysis
The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI
More information"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008
Description Course Summary This course provides students with the knowledge and skills to develop high-performance computing (HPC) applications for Microsoft. Students learn about the product Microsoft,
More informationThe Eclipse Parallel Tools Platform
May 1, 2012 Toward an Integrated Development Environment for Improved Software Engineering on Crays Agenda 1. What is the Eclipse Parallel Tools Platform (PTP) 2. Tour of features available in Eclipse/PTP
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationA Lightweight Library for Building Scalable Tools
A Lightweight Library for Building Scalable Tools Emily R. Jacobson, Michael J. Brim, Barton P. Miller Paradyn Project University of Wisconsin jacobson@cs.wisc.edu June 6, 2010 Para 2010: State of the
More informationSchool of Computer Science and Software Engineering Monash University. A Relative Debugger for Eclipse
School of Computer Science and Software Engineering Monash University Bachelor of Computing Honours (0194) Caulfield Campus Literature Review 2002 A Relative Debugger for Eclipse By Ka Chung Ho (Tim) -
More informationExecuting dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot
Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot Research in Advanced DIstributed Cyberinfrastructure & Applications Laboratory (RADICAL) Rutgers University http://radical.rutgers.edu
More informationDebugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors
Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors Chris Gottbrath Rogue Wave Software Boulder, CO Chris.Gottbrath@roguewave.com Abstract Intel Xeon Phi coprocessors present
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationMRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools Philip C. Roth, Dorian C. Arnold, and Barton P. Miller Computer Sciences Department University of Wisconsin, Madison 1210 W. Dayton
More informationCOMP 605: Introduction to Parallel Computing Lecture : GPU Architecture
COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:
More informationUNIVERSITY OF CALIFORNIA, SAN DIEGO. Scalable Dynamic Instrumentation for Large Scale Machines
UNIVERSITY OF CALIFORNIA, SAN DIEGO Scalable Dynamic Instrumentation for Large Scale Machines A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer
More informationCC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters
CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Dept. of Computer Science Florida State University Tallahassee, FL 32306 {karwande,xyuan}@cs.fsu.edu
More informationHPC learning using Cloud infrastructure
HPC learning using Cloud infrastructure Florin MANAILA IT Architect florin.manaila@ro.ibm.com Cluj-Napoca 16 March, 2010 Agenda 1. Leveraging Cloud model 2. HPC on Cloud 3. Recent projects - FutureGRID
More informationFoster s Methodology: Application Examples
Foster s Methodology: Application Examples Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 19, 2011 CPD (DEI / IST) Parallel and
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationSLURM Operation on Cray XT and XE
SLURM Operation on Cray XT and XE Morris Jette jette@schedmd.com Contributors and Collaborators This work was supported by the Oak Ridge National Laboratory Extreme Scale Systems Center. Swiss National
More informationAddressing Performance and Programmability Challenges in Current and Future Supercomputers
Addressing Performance and Programmability Challenges in Current and Future Supercomputers Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. VI-HPS - SC'13 Luiz DeRose 2013
More informationCharm++ for Productivity and Performance
Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationMessage-Passing Programming with MPI
Message-Passing Programming with MPI Message-Passing Concepts David Henty d.henty@epcc.ed.ac.uk EPCC, University of Edinburgh Overview This lecture will cover message passing model SPMD communication modes
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationHDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002
HDF5 I/O Performance HDF and HDF-EOS Workshop VI December 5, 2002 1 Goal of this talk Give an overview of the HDF5 Library tuning knobs for sequential and parallel performance 2 Challenging task HDF5 Library
More informationOutline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers
Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS
More informationThe Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011
The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities
More informationReducing Cluster Compatibility Mode (CCM) Complexity
Reducing Cluster Compatibility Mode (CCM) Complexity Marlys Kohnke Cray Inc. St. Paul, MN USA kohnke@cray.com Abstract Cluster Compatibility Mode (CCM) provides a suitable environment for running out of
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationA New. Applications LARGE scientific codes are constantly evolving. Refinements
Relative Debugging: A New Methodology for Debugging Scientific Applications LARGE scientific codes are constantly evolving. Refinements in understanding physical phenomena result in Accounting for discrepancies
More informationLightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University
Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND
More informationAn evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks
An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract
More informationSami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1
Acknowledgements: Petra Kogel Sami Saarinen Peter Towers 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Motivation Opteron and P690+ clusters MPI communications IFS Forecast Model IFS 4D-Var
More informationA Debugging Tool for Software Evolution
CASE-95, 7th International Workshop on Computer-Aided Software Engineering, Toronto, Ontario, Canada, July 1995, A Debugging Tool for Software Evolution D. Abramson R. Sosic School of Computing and Information
More informationRevealing Applications Access Pattern in Collective I/O for Cache Management
Revealing Applications Access Pattern in for Yin Lu 1, Yong Chen 1, Rob Latham 2 and Yu Zhuang 1 Presented by Philip Roth 3 1 Department of Computer Science Texas Tech University 2 Mathematics and Computer
More informationToday. Operating System Evolution. CSCI 4061 Introduction to Operating Systems. Gen 1: Mono-programming ( ) OS Evolution Unix Overview
Today CSCI 4061 Introduction to s Instructor: Abhishek Chandra OS Evolution Unix Overview Unix Structure Shells and Utilities Calls and APIs 2 Evolution How did the OS evolve? Generation 1: Mono-programming
More informationLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson
More information1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects
Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationScalable Interaction with Parallel Applications
Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign Outline Overview Case Studies Charm++
More informationUser Training Cray XC40 IITM, Pune
User Training Cray XC40 IITM, Pune Sudhakar Yerneni, Raviteja K, Nachiket Manapragada, etc. 1 Cray XC40 Architecture & Packaging 3 Cray XC Series Building Blocks XC40 System Compute Blade 4 Compute Nodes
More informationAbhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Motivation: Contention Experiments Bhatele, A., Kale, L. V. 2008 An Evaluation
More informationImproving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box
More informationModeling Cone-Beam Tomographic Reconstruction U sing LogSMP: An Extended LogP Model for Clusters of SMPs
Modeling Cone-Beam Tomographic Reconstruction U sing LogSMP: An Extended LogP Model for Clusters of SMPs David A. Reimann, Vipin Chaudhary 2, and Ishwar K. Sethi 3 Department of Mathematics, Albion College,
More informationToday. Operating System Evolution. CSCI 4061 Introduction to Operating Systems. Gen 1: Mono-programming ( ) OS Evolution Unix Overview
Today CSCI 4061 Introduction to s Instructor: Abhishek Chandra OS Evolution Unix Overview Unix Structure Shells and Utilities Calls and APIs 2 Evolution How did the OS evolve? Dependent on hardware and
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationDistributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5
Distributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5 Paper IEEE Computer (May 2016) What is DAS? Distributed common infrastructure for Dutch Computer Science Distributed: multiple (4-6) clusters
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationDebugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.
Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors
More informationOperational Robustness of Accelerator Aware MPI
Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers
More informationHigh-Performance Broadcast for Streaming and Deep Learning
High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction
More informationDiPerF: automated DIstributed PERformance testing Framework
DiPerF: automated DIstributed PERformance testing Framework Ioan Raicu, Catalin Dumitrescu, Matei Ripeanu, Ian Foster Distributed Systems Laboratory Computer Science Department University of Chicago Introduction
More informationPerformance and Scalability with Griddable.io
Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.
More informationResource allocation and utilization in the Blue Gene/L supercomputer
Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa Agenda Blue Gene/L Background Blue Gene/L
More informationWorkload Characterization using the TAU Performance System
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of
More informationFuture Routing Schemes in Petascale clusters
Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract
More informationModelling and implementation of algorithms in applied mathematics using MPI
Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance
More informationIntel Parallel Studio XE 2017 Composer Edition BETA C++ - Debug Solutions Release Notes
Developer Zone Intel Parallel Studio XE 2017 Composer Edition BETA C++ - Debug Solutions Release Notes Submitted by Georg Z. (Intel) on August 5, 2016 This page provides the current Release Notes for the
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationIntroduction to High Performance Computing and X10
Introduction to High Performance Computing and X10 Special Topic For Comp 621 Vineet Kumar High Performance Computing Supercomputing Grid computers, Multi-cores, clusters massively parallel computing Used
More informationDebugging Programs Accelerated with Intel Xeon Phi Coprocessors
Debugging Programs Accelerated with Intel Xeon Phi Coprocessors A White Paper by Rogue Wave Software. Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.roguewave.com Debugging
More informationThe Cray Programming Environment. An Introduction
The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationParallel Computer Architecture II
Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de
More informationEvaluating New Communication Models in the Nek5000 Code for Exascale
Evaluating New Communication Models in the Nek5000 Code for Exascale Ilya Ivanov (KTH), Rui Machado (Fraunhofer), Mirko Rahn (Fraunhofer), Dana Akhmetova (KTH), Erwin Laure (KTH), Jing Gong (KTH), Philipp
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationBlue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft
Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small
More informationData Management in Parallel Scripting
Data Management in Parallel Scripting Zhao Zhang 11/11/2012 Problem Statement Definition: MTC applications are those applications in which existing sequential or parallel programs are linked by files output
More informationTAUoverMRNet (ToM): A Framework for Scalable Parallel Performance Monitoring
: A Framework for Scalable Parallel Performance Monitoring Aroon Nataraj, Allen D. Malony, Alan Morris University of Oregon {anataraj,malony,amorris}@cs.uoregon.edu Dorian C. Arnold, Barton P. Miller University
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationCSE5351: Parallel Processing Part III
CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?
More informationWebSphere Application Server Base Performance
WebSphere Application Server Base Performance ii WebSphere Application Server Base Performance Contents WebSphere Application Server Base Performance............. 1 Introduction to the WebSphere Application
More informationParallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015
Parallel Programming Presentation to Linux Users of Victoria, Inc. November 4th, 2015 http://levlafayette.com 1.0 What Is Parallel Programming? 1.1 Historically, software has been written for serial computation
More informationUsing Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh
Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software
More informationA Lightweight Library for Building Scalable Tools
A Lightweight Library for Building Scalable Tools Emily R. Jacobson, Michael J. Brim, and Barton P. Miller Computer Sciences Department, University of Wisconsin Madison, Wisconsin, USA {jacobson,mjbrim,bart}@cs.wisc.edu
More informationWhatÕs New in the Message-Passing Toolkit
WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2
More informationTuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov
Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright njwright @ lbl.gov NERSC- National Energy Research Scientific Computing Center Mission: Accelerate the pace of scientific discovery
More informationIntel MPI Cluster Edition on Graham A First Look! Doug Roberts
Intel MPI Cluster Edition on Graham A First Look! Doug Roberts SHARCNET / COMPUTE CANADA Intel Parallel Studio XE 2016 Update 4 Cluster Edition for Linux 1. Intel(R) MPI Library 5.1 Update 3 Cluster Ed
More informationSplotch: High Performance Visualization using MPI, OpenMP and CUDA
Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,
More informationHigh Throughput WAN Data Transfer with Hadoop-based Storage
High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San
More informationDelivers cost savings, high definition display, and supercharged sharing
TM OpenText TM Exceed TurboX Delivers cost savings, high definition display, and supercharged sharing OpenText Exceed TurboX is an advanced solution for desktop virtualization and remote access to enterprise
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationLecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H Katz Computer Science 252 Spring 996 RHKS96 Review: Storage System Issues Historical Context of Storage I/O Storage I/O Performance
More informationBisection Debugging. 1 Introduction. Thomas Gross. Carnegie Mellon University. Preliminary version
Bisection Debugging Thomas Gross School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Institut für Computer Systeme ETH Zürich CH 8092 Zürich Preliminary version Abstract This paper
More informationPower Aware Hierarchical Epidemics in P2P Systems Emrah Çem, Tuğba Koç, Öznur Özkasap Koç University, İstanbul
Power Aware Hierarchical Epidemics in P2P Systems Emrah Çem, Tuğba Koç, Öznur Özkasap Koç University, İstanbul COST Action IC0804 Workshop in Budapest - Working Group 3 May 19th 2011 supported by TUBITAK
More information