HPCToolkit: Performance Tools for Parallel Scientific Codes

Size: px
Start display at page:

Download "HPCToolkit: Performance Tools for Parallel Scientific Codes"

Transcription

1 HPCToolkit: Performance Tools for Parallel Scientific Codes John Mellor-Crummey Department of Computer Science Rice University Building Community Codes for Effective Scientific Research on HPC Platforms September 7,

2 Challenges for Computational Scientists Execution environments and applications are rapidly evolving architecture rapidly changing multicore microprocessor designs increasing scale of parallel systems growing use of accelerators applications transition from MPI everywhere to threaded implementations add additional scientific capabilities maintain multiple variants or configurations Computational scientists need to assess weaknesses in algorithms and their implementations improve scalability of executions within and across nodes adapt to changes in emerging architectures Performance tools can play an important role as a guide 2

3 Performance Analysis Challenges Complex architectures are hard to use efficiently multi-level parallelism: multi-core, ILP, SIMD instructions multi-level memory hierarchy result: gap between typical and peak performance is huge Complex applications present challenges measurement and analysis understanding behaviors and tuning performance Supercomputer platforms compound the complexity unique hardware unique microkernel-based operating systems multifaceted performance concerns computation data movement communication I/O 3

4 What Users Want Multi-platform, programming model independent tools Accurate measurement of complex parallel codes large, multi-lingual programs fully optimized code: loop optimization, templates, inlining binary-only libraries, sometimes partially stripped complex execution environments dynamic loading, static linking SPMD parallel codes with threaded node programs batch jobs Effective performance analysis insightful analysis that pinpoints and explains problems correlate measurements with code for actionable results support analysis at the desired level intuitive enough for application scientists and engineers detailed enough for library developers and compiler writers Scalable to petascale and beyond 4

5 We Build It * HPCToolkit - 160K lines, 797 files measurement, data analysis: 110K lines C/C++, scripts; 424 files hpcviewer, hpctraceviewer GUIs: 54K lines Java; 373 files HPCToolkit externals - 2.5M lines C/C++, 5782 files components developed execution control: libmonitor - 7K lines, 35 files binary analysis: Open Analysis - 76K lines, 343 files (+ ANL, Colorado) components extensively modified binary analysis: GNU binutils M total, 1650 files; (448K bfd) other components stack unwinding: libunwind XML: libxml2, xerces understanding binaries: libelf, libdwarf, symtabapi * With support from the US government DOE Office of Science: DE-FC02-07ER25800, DE-FC02-06ER25762 LANL: G, , , AFRL: FA C

6 Contributors Current staff: Michael Fagan, Mark Krentel, Laksono Adhianto students: Xu Liu, Milind Chabbi, Karthik Murthy external: Nathan Tallent (PNNL) Alumni students: Gabriel Marin (ORNL), Nathan Froyd (Mozilla) staff: Rob Fowler (UNC) interns: Sinchan Banerjee (MIT), Michael Franco (Rice), Reed Landrum (Stanford), Bowden Kelly (Georgia Tech), Philip Taffet (St. John s High School) 6

7 HPCToolkit Approach Employ binary-level measurement and analysis observe fully optimized, dynamically linked executions support multi-lingual codes with external binary-only libraries Use sampling-based measurement (avoid instrumentation) controllable overhead minimize systematic error and avoid blind spots enable data collection for large-scale parallelism Collect and correlate multiple derived performance metrics diagnosis typically requires more than one species of metric Associate metrics with both static and dynamic context loop nests, procedures, inlined code, calling context Support top-down performance analysis natural approach that minimizes burden on developers 7

8 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 8

9 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 9

10 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure For dynamically-linked executables on stock Linux compile and link as you usually do For statically-linked executables (e.g. for Blue Gene, Cray) add monitoring by using hpclink as prefix to your link line presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 10

11 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure Measure execution unobtrusively launch optimized application binaries dynamically-linked applications: launch with hpcrun e.g., mpirun -np 8192 hpcrun -t -e flash3... statically-linked applications: control with environment variables collect statistical call path profiles of events of interest presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 11

12 Call Path Profiling Measure and attribute costs in context sample timer or hardware counter overflows gather calling context using stack unwinding Call path sample return address Calling context tree return address return address instruction pointer Overhead proportional to sampling frequency......not call frequency 12

13 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure Analyze binary with hpcstruct: recover program structure analyze machine code, line map, debugging information extract loop nesting & identify inlined procedures map transformed loops and procedures to source presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 13

14 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis [hpcstruct] call path profile program structure Combine multiple profiles multiple threads; multiple processes; multiple executions Correlate metrics to static & dynamic program structure presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 14

15 HPCToolkit Workflow source code compile & link optimized binary profile execution [hpcrun] binary analysis call path profile program structure [hpcstruct] Presentation explore performance data from multiple perspectives rank order by metrics to focus on what s important compute derived metrics to help gain insight e.g. scalability losses, waste, CPI, bandwidth graph thread-level metrics for contexts explore evolution of behavior over time presentation [hpcviewer/ hpctraceviewer] database interpret profile correlate w/ source [hpcprof/hpcprof-mpi] 15

16 Analyzing with hpcviewer source pane metric display view control costs for inlined procedures loops function calls in full context navigation pane metric pane 16

17 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 17

18 The Problem of Scaling ? Efficiency Ideal efficiency Actual efficiency CPUs Note: higher is better 18

19 Wanted: Scalability Analysis Isolate scalability bottlenecks Guide user to problems Quantify the magnitude of each problem 19

20 Challenges for Pinpointing Scalability Bottlenecks Parallel applications modern software uses layers of libraries performance is often context dependent Example climate code skeleton main land sea ice ocean atmosphere wait wait wait wait Monitoring bottleneck nature: computation, data movement, synchronization? 2 pragmatic constraints acceptable data volume low perturbation for use in production runs 20

21 Performance Analysis with Expectations You have performance expectations for your parallel code strong scaling: linear speedup weak scaling: constant execution time Put your expectations to work measure performance under different conditions e.g. different levels of parallelism or different inputs express your expectations as an equation compute the deviation from expectations for each calling context for both inclusive and exclusive costs correlate the metrics with the source code explore the annotated call tree interactively 21

22 Pinpointing and Quantifying Scalability Bottlenecks P 600K Q 400K = P Q coefficients for analysis of strong scaling 200K 22

23 Scalability Analysis Demo: FLASH3 Code: University of Chicago FLASH3 Simulation: Platform: Experiment: 8192 vs. 256 processors Scaling type: weak Parallel, adaptive-mesh refinement (AMR) code Block structured AMR; a block is the unit of computation Designed for white compressible dwarf reactive detonation flows Can solve a broad range of (astro)physical problems Portable: runs Blue on many Gene/P massively-parallel systems Scales and performs well Fully modular and extensible: components can be combined to create many different applications Nova outbursts on white dwarfs Laser-driven shock instabilities Magnetic Rayleigh-Taylor Cellular detonation Helium burning on neutron stars Figures courtesy of FLASH Team, University of Chicago Orzag/Tang MHD vortex Rayleigh-Taylor instability 23

24 Improved Flash Scaling of AMR Setup Graph courtesy of Anshu Dubey, U Chicago 24

25 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 25

26 Understanding Temporal Behavior Profiling compresses out the temporal dimension temporal patterns, e.g. serialization, are invisible in profiles What can we do? Trace call path samples sketch: N times per second, take a call path sample of each thread organize the samples for each thread along a time line view how the execution evolves left to right what do we view? assign each procedure a color; view a depth slice of an execution Processes Time Call stack 26

27 hpctraceviewer: detail of Load imbalance among threads appears as different lengths of colored bands along the x axis 27

28 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 28

29 960 cores, radix sort Two views of load imbalance since not on a 2 k cores 29

30 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 30

31 Blame Shifting Problem: in many circumstances sampling measures symptoms of performance losses rather than causes worker threads waiting for work threads waiting for a lock MPI process waiting for peers in a collective communication idle GPU waiting for work Approach: shift blame for losses from victims to perpetrators blame code executing while other threads are idle blame code executed by lock holder when thread(s) are waiting blame processes that arrive late to collectives shift blame between CPU and GPU for hybrid code 31

32 Performance Expectations for Hybrid Code with Blame Shifting 32

33 Performance Expectations for Hybrid Code with Blame Shifting 33

34 Performance Expectations for Hybrid Code with Blame Shifting 34

35 Performance Expectations for Hybrid Code with Blame Shifting 35

36 GPU Successes with HPCToolkit LAMMPS: identified hardware problem with Keeneland system improperly seated GPUs were observed to have lower data copy bandwidth LULESH: identified the dynamic memory allocation using cudamalloc and cudafree accounted for 90% of the idleness of the GPU 36

37 Data Centric Analysis Goal: associate memory hierarchy performance losses with data Approach intercept allocations to associate with their data ranges measure latency with instruction-based sampling (AMD Opteron) present quantitative results using hpcviewer 37

38 Data Centric Analysis of S3D yspecies latency for this loop is 14.5% of total latency in program 41.2% of memory hierarchy latency related to yspecies array 38

39 Outline Overview of Rice s HPCToolkit Pinpointing scalability bottlenecks scalability bottlenecks on large-scale parallel systems scaling on multicore processors Understanding temporal behavior Assessing process variability Understanding threading, GPU, and memory hierarchy blame shifting attributing memory hierarchy costs to data Summary and conclusions 39

40 Summary Sampling provides low overhead measurement Call path profiling + binary analysis + blame shifting = insight scalability bottlenecks where insufficient parallelism lurks sources of lock contention load imbalance temporal dynamics bottlenecks in hybrid code problematic data structures Other capabilities attribute memory leaks back to their full calling context 40

41 Current Technical Issues Keeping up with emerging platforms... Blue Waters (Cray XK6) AMD Interlagos introduces new vector instructions issue: binary analysis tools need update (even gdb is ignorant at present!) NVIDIA Tesla K20 GPUs issue: monitoring and assessing impact of asynchronous activities OpenACC programming model issue: creates threads that synchronize before main Stampede (Intel MIC) new instruction set for MIC new programming models MIC only programming MIC as MPI rank (ranks with heterogeneous capabilities) offload model massive threading Keeneland (NVIDIA Tesla K20 GPU) multi-socket, multi-gpu nodes 41

42 A Spectrum of Tool Challenges Intellectual research - develop new techniques for measurement, analysis, and presentation challenges of emerging systems increasing scale of systems (e.g. Sequoia) heterogeneity (e.g., host + accelerator (MIC or GPU); AMD Fusion) exploding growth of threading: MIC supports 200+ threads blame shifting: identify causes rather than symptoms analyzing asynchronous activities measure and analyze all facets of application performance CPU, accelerator, intra- and inter-node data move & sync, I/O, interaction with HW, interaction with other jobs, interaction with system software provide higher level insight, diagnosis, and guidance Community leadership and engagement OS support for tools past successes: BG/P CNK and Cray CNL kernels; BG/Q spec; Linux kernel; PMU device drivers today: interfaces for observing communication, I/O network issues, data access latency standards committees: today - OpenMP tools API; future - OpenCL tools API? vendor engagement: NVIDIA, Intel, IBM outreach with workshops Software development: new instructions for new CPUs; new programming models for heterogeneous maintenance deployment & user support 42

43 A Community of End Users Government laboratories DOE Office of Science Labs Argonne National Laboratory Lawrence Berkeley National Laboratory Oak Ridge National Laboratory National Renewable Energy Laboratory DOE NNSA Labs Sandia National Laboratory Los Alamos National Laboratory Lawrence Livermore National Laboratory DOD Engineering Research Development Center TACC 25 PerfExpert Sites Numerous universities University of Utah University of Southern California Ohio State UNC Chapel Hill Mississippi State Georgia Tech... Companies Bull Cray Computer SAS Institute Western Geco AMD Numerical Algorithms Group Shell Foreign centers Juelich Supercomputing Centre Britain s Atomic Weapons Establishment Australian National University Laboratoire d'aérologie, O.M.P. INRIA Norwegian Metacenter for High Performance Computing 43

44 A Community: One Group s Tools are another group s infrastructure PerfExpert (University of Texas and Texas State) uses HPCToolkit s measurement and analysis infrastructure to collect hardware performance counter data, analyze application binaries, and attribute performance to routines and loop nests bullx (Bull) HPCToolkit performance tools are part of the standard bullx software suite deployed on Bull s supercomputers custom extensions to hpcviewer to support cluster analysis MIAMI (ORNL) uses HPCToolkit s hpcviewer user interface for interactive analysis of performance modeling results OpenSpeedshop (Krell) uses HPCToolkit s libmonitor for Scott Pakin (LANL) uses HPCToolkit s hpcviewer user interface for interactive analysis of performance measurement data collected with custom instrumentation 44

45 Software Engineering? Ideally: design abstractions and implement them Harsh reality: performance tools interact with the system at the lowest level, which introduces complexity and design compromises absence of proper interfaces and mechanisms in layers we interact with blocking system calls that are invisible to timer-based profiling, system calls that can t be restarted when interrupted stripped code (e.g., runtimes for OpenMP, CUDA) wrong information from compilers libraries (e.g., libc) that lack proper interfaces for wrapping helper threads launched during initialization of dynamic libraries lack of standard calling conventions in low-level code (e.g., dynamic library loading)... experimentally determine what s possible and how things work e.g., how can I get notification when CUDA kernels complete without serializing multiple threads that share a GPU only a stripped library from NVIDIA; no behavioral documentation grow a piece of infrastructure from a prototype With a larger team, we d re-implement some components from scratch with the benefit of hindsight Many packages we depend upon leave much to be desired but there is insufficient funding to rewrite them 45

46 Open Issues Incentives are lacking for community performance tools rational model: build reusable building tool blocks reality funding for research, not development components help our competitors more than us Ongoing funding for maintenance, deployment, user support? Continuity? 46

47 HPCToolkit Capabilities at a Glance Attribute Costs to Code Analyze Behavior over Time Pinpoint & Quantify Scaling Bottlenecks Shift Blame from Symptoms to Causes Assess Imbalance and Variability Associate Costs with Data hpctoolkit.org

HPCToolkit: Sampling-based Performance Tools for Leadership Computing

HPCToolkit: Sampling-based Performance Tools for Leadership Computing HPCToolkit: Sampling-based Performance Tools for Leadership Computing John Mellor-Crummey Department of Computer Science Rice University johnmc@cs.rice.edu http://hpctoolkit.org CScADS 2009 Workshop on

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles

Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles Nathan Tallent Laksono Adhianto, John Mellor-Crummey Mike Fagan, Mark Krentel Rice University CScADS Workshop August

More information

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University   Scalable Tools Workshop 7 August 2017 Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

Sampling-based Strategies for Measurement and Analysis

Sampling-based Strategies for Measurement and Analysis Center for Scalable Application Development Software Sampling-based Strategies for Measurement and Analysis John Mellor-Crummey Department of Computer Science Rice University CScADS Petascale Performance

More information

HPCToolkit User s Manual

HPCToolkit User s Manual HPCToolkit User s Manual John Mellor-Crummey Laksono Adhianto, Mike Fagan, Mark Krentel, Nathan Tallent Rice University October 9, 2017 Contents 1 Introduction 1 2 HPCToolkit Overview 5 2.1 Asynchronous

More information

Compilers and Compiler-based Tools for HPC

Compilers and Compiler-based Tools for HPC Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc. Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

The Titan Tools Experience

The Titan Tools Experience The Titan Tools Experience Michael J. Brim, Ph.D. Computer Science Research, CSMD/NCCS Petascale Tools Workshop 213 Madison, WI July 15, 213 Overview of Titan Cray XK7 18,688+ compute nodes 16-core AMD

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS

IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS C OV ER F E AT U RE IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS Nathan R. Tallent and John M. Mellor-Crummey, Rice University Work stealing is an effective load-balancing strategy

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Application / User Viewpoint

Application / User Viewpoint SC 07 Reno, Nevada November 15, 2007 Fortran@50 Application / User Viewpoint Henry Tufo Computer Science Section Head Associate Professor and Director, Computational Science Center Department of Computer

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

ScalaIOTrace: Scalable I/O Tracing and Analysis

ScalaIOTrace: Scalable I/O Tracing and Analysis ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,

More information

Hardware Performance Monitoring Unit Working Group Outbrief

Hardware Performance Monitoring Unit Working Group Outbrief Hardware Performance Monitoring Unit Working Group Outbrief CScADS Performance Tools for Extreme Scale Computing August 2011 hpctoolkit.org Topics From HW-centric measurements to application understanding

More information

Pinpointing Data Locality Problems Using Data-centric Analysis

Pinpointing Data Locality Problems Using Data-centric Analysis Center for Scalable Application Development Software Pinpointing Data Locality Problems Using Data-centric Analysis Xu Liu XL10@rice.edu Department of Computer Science Rice University Outline Introduction

More information

Operational Robustness of Accelerator Aware MPI

Operational Robustness of Accelerator Aware MPI Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Center for Scalable Application Development Software: Application Engagement. Ewing Lusk (ANL) Gabriel Marin (Rice)

Center for Scalable Application Development Software: Application Engagement. Ewing Lusk (ANL) Gabriel Marin (Rice) Center for Scalable Application Development Software: Application Engagement Ewing Lusk (ANL) Gabriel Marin (Rice) CScADS Midterm Review April 22, 2009 1 Application Engagement Workshops (2 out of 4) for

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

Scalability Improvements in the TAU Performance System for Extreme Scale

Scalability Improvements in the TAU Performance System for Extreme Scale Scalability Improvements in the TAU Performance System for Extreme Scale Sameer Shende Director, Performance Research Laboratory, University of Oregon TGCC, CEA / DAM Île de France Bruyères- le- Châtel,

More information

A Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012

A Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012 A Software Developing Environment for Earth System Modeling Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012 1 Outline Motivation Purpose and Significance Research Contents Technology

More information

Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI

Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI Gabriel Marin February 11, 2014 MPAS-Ocean [4] is a component of the MPAS framework of climate models. MPAS-Ocean is an unstructured-mesh

More information

IBM High Performance Computing Toolkit

IBM High Performance Computing Toolkit IBM High Performance Computing Toolkit Pidad D'Souza (pidsouza@in.ibm.com) IBM, India Software Labs Top 500 : Application areas (November 2011) Systems Performance Source : http://www.top500.org/charts/list/34/apparea

More information

Allinea Unified Environment

Allinea Unified Environment Allinea Unified Environment Allinea s unified tools for debugging and profiling HPC Codes Beau Paisley Allinea Software bpaisley@allinea.com 720.583.0380 Today s Challenge Q: What is the impact of current

More information

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University ELP Effektive Laufzeitunterstützung für zukünftige Programmierstandards Agenda ELP Project Goals ELP Achievements Remaining Steps ELP Project Goals Goals of ELP: Improve programmer productivity By influencing

More information

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Open SpeedShop Capabilities and Internal Structure: Current to Petascale. CScADS Workshop, July 16-20, 2007 Jim Galarowicz, Krell Institute

Open SpeedShop Capabilities and Internal Structure: Current to Petascale. CScADS Workshop, July 16-20, 2007 Jim Galarowicz, Krell Institute 1 Open Source Performance Analysis for Large Scale Systems Open SpeedShop Capabilities and Internal Structure: Current to Petascale CScADS Workshop, July 16-20, 2007 Jim Galarowicz, Krell Institute 2 Trademark

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

Understanding Executables Working Group

Understanding Executables Working Group Center for Scalable Application Development Software Understanding Executables Working Group CScADS Petascale Performance Tools Workshop, July 2007 1 Working Group Members Drew Bernat Robert Cohn Mike

More information

The Cray Rainier System: Integrated Scalar/Vector Computing

The Cray Rainier System: Integrated Scalar/Vector Computing THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Improving Applica/on Performance Using the TAU Performance System

Improving Applica/on Performance Using the TAU Performance System Improving Applica/on Performance Using the TAU Performance System Sameer Shende, John C. Linford {sameer, jlinford}@paratools.com ParaTools, Inc and University of Oregon. April 4-5, 2013, CG1, NCAR, UCAR

More information

Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon

Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon Agenda Cray Performance Tools Overview Recent Enhancements Support for Cray systems with KNL 2 Cray Performance Analysis Tools

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

Parallel Programming with MPI on Clusters

Parallel Programming with MPI on Clusters Parallel Programming with MPI on Clusters Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory (The rest of our group: Bill Gropp, Rob Ross, David Ashton, Brian Toonen, Anthony

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

Software and Tools for HPE s The Machine Project

Software and Tools for HPE s The Machine Project Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric

More information

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Using Intel VTune Amplifier XE and Inspector XE in.net environment Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

Oracle Developer Studio Performance Analyzer

Oracle Developer Studio Performance Analyzer Oracle Developer Studio Performance Analyzer The Oracle Developer Studio Performance Analyzer provides unparalleled insight into the behavior of your application, allowing you to identify bottlenecks and

More information

Illinois Proposal Considerations Greg Bauer

Illinois Proposal Considerations Greg Bauer - 2016 Greg Bauer Support model Blue Waters provides traditional Partner Consulting as part of its User Services. Standard service requests for assistance with porting, debugging, allocation issues, and

More information

Christopher Sewell Katrin Heitmann Li-ta Lo Salman Habib James Ahrens

Christopher Sewell Katrin Heitmann Li-ta Lo Salman Habib James Ahrens LA-UR- 14-25437 Approved for public release; distribution is unlimited. Title: Portable Parallel Halo and Center Finders for HACC Author(s): Christopher Sewell Katrin Heitmann Li-ta Lo Salman Habib James

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes, Damaris In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations KerData Team Inria Rennes, http://damaris.gforge.inria.fr Outline 1. From I/O to in-situ visualization 2. Damaris approach

More information

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori

More information

Scalable Debugging with TotalView on Blue Gene. John DelSignore, CTO TotalView Technologies

Scalable Debugging with TotalView on Blue Gene. John DelSignore, CTO TotalView Technologies Scalable Debugging with TotalView on Blue Gene John DelSignore, CTO TotalView Technologies Agenda TotalView on Blue Gene A little history Current status Recent TotalView improvements ReplayEngine (reverse

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Scalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer

Scalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer Scalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer Kevin A. Huck, Allen D. Malony, Sameer Shende, Alan Morris khuck, malony, sameer, amorris@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

An Introduction to the SPEC High Performance Group and their Benchmark Suites

An Introduction to the SPEC High Performance Group and their Benchmark Suites An Introduction to the SPEC High Performance Group and their Benchmark Suites Robert Henschel Manager, Scientific Applications and Performance Tuning Secretary, SPEC High Performance Group Research Technologies

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

Transport Simulations beyond Petascale. Jing Fu (ANL)

Transport Simulations beyond Petascale. Jing Fu (ANL) Transport Simulations beyond Petascale Jing Fu (ANL) A) Project Overview The project: Peta- and exascale algorithms and software development (petascalable codes: Nek5000, NekCEM, NekLBM) Science goals:

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman

More information

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

John Levesque Nov 16, 2001

John Levesque Nov 16, 2001 1 We see that the GPU is the best device available for us today to be able to get to the performance we want and meet our users requirements for a very high performance node with very high memory bandwidth.

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Responsive Large Data Analysis and Visualization with the ParaView Ecosystem. Patrick O Leary, Kitware Inc

Responsive Large Data Analysis and Visualization with the ParaView Ecosystem. Patrick O Leary, Kitware Inc Responsive Large Data Analysis and Visualization with the ParaView Ecosystem Patrick O Leary, Kitware Inc Hybrid Computing Attribute Titan Summit - 2018 Compute Nodes 18,688 ~3,400 Processor (1) 16-core

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul

More information

Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model

Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model Lai Wei, Ignacio Laguna, Dong H. Ahn Matthew P. LeGendre, Gregory L. Lee This work was performed under the auspices of the

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Center Extreme Scale CS Research

Center Extreme Scale CS Research Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those

More information

NUIT Tech Talk Topics in Research Computing: XSEDE and Northwestern University Campus Champions

NUIT Tech Talk Topics in Research Computing: XSEDE and Northwestern University Campus Champions NUIT Tech Talk Topics in Research Computing: XSEDE and Northwestern University Campus Champions Pradeep Sivakumar pradeep-sivakumar@northwestern.edu Contents What is XSEDE? Introduction Who uses XSEDE?

More information

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Quinn Mitchell HPC UNIX/LINUX Storage Systems ORNL is managed by UT-Battelle for the US Department of Energy U.S. Department

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information