Koronis Performance Tuning 2. By Brent Swartz December 1, 2011
|
|
- Tobias Howard
- 5 years ago
- Views:
Transcription
1 Koronis Performance Tuning 2 By Brent Swartz December 1, 2011
2 Application Tuning Methodology coll=linux&db=bks&cmd=toc&pth=/sgi_develop er/lx_86_apptune
3 1. Tune the serial performance If possible incorporate already tuned lib routines, e.g.: MKL: Intel's Math Kernel Library. Library includes BLAS, LAPACK, Sparse Matrix, and FFT routines. VML: the Vector Math Library, part of the MKL package (libmkl_vml_itp.so). Described here: tm?iid=ipp_home+software_libraries&
4 2. Tune parallel performance a. OpenMP: Stream OpenMP results/graphs, before and after dplace. b. For MPI, can use perfboost/perfcatcher. Show stream MPI results/graphs. c. Can improve scaling by determining which routines are NOT scaling well (sorting the routines by a metric), and then improve those routines. The metric can be as simple as the ratio of routine times running on different numbers of nodes (e.g. run on 24 and then 72 procs, then sort the ratio of the routine times).
5 Profiling Tools 1. gprof (easiest) 2. perfcatch/perfboost (MPI) See PerfBoost in the Message Passing Toolkit (MPT) User's Guide available on the Tech Pubs Library at 3. OPT 4. TAU
6 Application Examples A. Stream B. NAMD
7 Stream Benchmark Download: Fortran is compiled with: mpif90 -O3 -ipo -xsse4.2 -fno-alias -i8 -openmp -extend_source -mcmodel=medium -i-dynamic -opt-streaming-stores always -nolib-inline C is compiled with: icc -O3 -ipo -xsse4.2 -fnoalias -openmp -extend_source -mcmodel=medium -i-dynamic -opt-streamingstores always -nolib-inline
8 Stream Benchmark (cont.) Stream benchmark: We start with the original stream benchmark, running the C and Fortran versions, running with and without dplace, from 4 to 720 procs (setting OMP_NUM_THREADS). Conclusion: The Fortran and C performance is essentially the same (for dplace anyway). Note how the dplace bandwidth is consistently better. Here is the PBS job that ran the benchmarks (note form of the dplace command):
9 #!/bin/bash #PBS -j oe #PBS -q uv1000 #PBS -l select=120:ncpus=6:mpiprocs=6 #PBS -l place=pack:excl:group=boardpair #PBS -l walltime=1:0:0 cpuset -d. sleep 5 ulimit -s unlimited pset=`qstat -f $PBS_JOBID grep pset cut -f 2- -d"="` echo "pset:" $pset cd $PBS_O_WORKDIR export KMP_AFFINITY=disabled for NT in do export OMP_NUM_THREADS=$NT NT1=`echo "$NT - 1" bc` echo "NT1=$NT1" dplace -x 2 -c 0-$NT1 stream_c.omp.exe > stream_c.omp.dplace.out.$nt stream_c.omp.exe > stream_c.omp.out.$nt dplace -x 2 -c 0-$NT1 stream_f.exe > stream_f.omp.dplace.out.$nt stream_f.exe > stream_f.omp.out.$nt done
10
11 Stream Benchmark (cont.) Now we look at the MPI Fortran Copy and Triad performance, with and without dplace. The number of procs is varied from 1 to 384 (1 rack). Conclusion: dplace does not affect MPI performance.
12
13 Stream Benchmark (cont.) Now we look at the OpenMP Triad performance with the MPI Triad performance. Conclusion: MPI is scaling much better. Why? Also KMP_AFFINITY=disabled (by default). KMP_AFFINITY determines how the threads are mapped to the cores ( ntation/studio/composer/enus/2009/compiler_c/optaps/common/optaps_op enmp_thread_affinity.htm).
14
15 Stream Benchmark (cont.) Now we do an empirical parameter study of various KMP_AFFINITY alternatives. Eleven alternatives are tried, with and without dplace. e.g. export KMP_AFFINITY=granularity=fine,compact,2,0 Conclusion: dplace is consistently worse for all KMP_AFFINITY settings except disabled and none. Optimal performance is seen with many KMP_AFFINITY settings, without dplace, e.g. KMP_AFFINITY=granularity=fine,compact,2,0 or KMP_AFFINITY=granularity=core,compact,1,0
16
17 Stream Benchmark (cont.) Looking at the OpenMP vs MPI performance, we realize the cause of the OpenMP performance falling off is Gustafson's Law (weak vs strong scaling): The original benchmark is spreading the 2M arrays across all threads, so at 100 threads, each proc (thread) is only working on 20,000 element array slices, so the time to do the work is approaching the time required to create the threads. The stream benchmark is recoded to malloc the 2M arrays instead of statically allocating them. So now we dynamically allocate arrays based on the number of threads (procs), giving a constant 2M/proc, and look at various good KMP_AFFINITY settings.
18
19 Stream Benchmark (cont.) Also tried OMPLACE_AFFINITY_COMPAT and omplace alternatives, with no improvement. Conclusion: Now the OpenMP performance is on a par with the MPI performance, and better for Nprocs between 1 and 192, transitioning to MPI performance for 193 to ~300, and matching MPI performance from 336 on, with KMP_AFFINITY=core,compact,1,0 OR KMP_AFFINITY=fine,compact,2,0 For MPI apps, no dplace or KMP_AFFINITY is needed.
20 NAMD NAMD 2.7 is currently installed on Koronis. NAMD 2.8 has been installed locally to test performance, and has shown a ~80% improvement in the apoa1 benchmark using the following invocation. numactl is used to obtain the processor list from the batch control system, and this proclist is fed into charm via the +setcpuaffinity flag. The stmv banchmark shows a performance degradation, so this is still experimental.
21 #!/bin/bash -x NAMD #PBS -j oe #PBS -q uv1000 #PBS -l select=7:ncpus=6:mpiprocs=6 #PBS -l place=pack:excl:group=iruquadrant #PBS -l walltime=1:0:0 cpuset -d. ; sleep 5 ulimit -s 100M ; ulimit -v unlimited export KMP_STACKSIZE=100M export NAMD_PATH=/home/koronis/swartzbr/namd/NAMD_2.8_Linux-x86_64multicore cd $PBS_O_WORKDIR proc_list=`numactl --show awk '/^physcpubind/ {printf "+p%d +pemap %d",(nf-1), $2; for(i=3;i<=nf;++i){printf ",%d",$i}}'` $NAMD_PATH/namd2 +isomalloc_sync +setcpuaffinity $proc_list apoa1.namd > apoa1.out.mc.39
22 NAMD NAMD can utilize Koronis GPUs for certain types of runs. For more info see: Versions of GAMESS, GROMACS and LAMMPS have just been released which can utilize GPUs for certain types of runs. These have not yet been installed on Koronis, so if you plan to use these codes on the GPUs, send an to:
Cerebro Quick Start Guide
Cerebro Quick Start Guide Overview of the system Cerebro consists of a total of 64 Ivy Bridge processors E5-4650 v2 with 10 cores each, 14 TB of memory and 24 TB of local disk. Table 1 shows the hardware
More informationWhy Combine OpenMP and MPI
Why Combine OpenMP and MPI OpenMP might not require copies of data structures Can have some interesting designs that overlap computation and communication Overcome the limits of small processor counts
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationAchieve Better Performance with PEAK on XSEDE Resources
Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,
More informationCombining OpenMP and MPI
Combining OpenMP and MPI Timothy H. Kaiser,Ph.D.. tkaiser@mines.edu 1 Overview Discuss why we combine MPI and OpenMP Intel Compiler Portland Group Compiler Run Scripts Challenge: What works for Stommel
More informationRunning applications on the Cray XC30
Running applications on the Cray XC30 Running on compute nodes By default, users do not access compute nodes directly. Instead they launch jobs on compute nodes using one of three available modes: 1. Extreme
More informationIntroduction to Unix Environment: modules, job scripts, PBS. N. Spallanzani (CINECA)
Introduction to Unix Environment: modules, job scripts, PBS N. Spallanzani (CINECA) Bologna PATC 2016 In this tutorial you will learn... How to get familiar with UNIX environment @ CINECA How to submit
More informationAmbiente CINECA: moduli, job scripts, PBS. A. Grottesi (CINECA)
Ambiente HPC @ CINECA: moduli, job scripts, PBS A. Grottesi (CINECA) Bologna 2017 In this tutorial you will learn... How to get familiar with UNIX environment @ CINECA How to submit your job to the PBS
More informationHow to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions
How to run applications on Aziz supercomputer Mohammad Rafi System Administrator Fujitsu Technology Solutions Agenda Overview Compute Nodes Storage Infrastructure Servers Cluster Stack Environment Modules
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationIntel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth
Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This
More informationXSEDE New User Tutorial
April 2, 2014 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Make sure you sign the sign in sheet! At the end of the module, I will ask you to
More informationParaTools ThreadSpotter Analysis of HELIOS
ParaTools ThreadSpotter Analysis of HELIOS ParaTools, Inc. 2836 Kincaid St. Eugene, OR 97405 (541) 913-8797 info@paratools.com Distribution Statement A: Approved for public release. Distribution is unlimited
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationBenchmark results on Knight Landing architecture
Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68
More informationIntroduction to CINECA Computer Environment
Introduction to CINECA Computer Environment Today you will learn... Basic commands for UNIX environment @ CINECA How to submitt your job to the PBS queueing system on Eurora Tutorial #1: Example: launch
More informationIntroduction to PICO Parallel & Production Enviroment
Introduction to PICO Parallel & Production Enviroment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Nicola Spallanzani n.spallanzani@cineca.it
More informationBrief notes on setting up semi-high performance computing environments. July 25, 2014
Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1
More informationIntroduction to GALILEO
November 27, 2016 Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it SuperComputing Applications and Innovation Department
More informationCray Scientific Libraries. Overview
Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized
More informationXSEDE New User Tutorial
June 12, 2015 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Please remember to sign in for today s event: http://bit.ly/1fashvo Also, please
More informationMathematical Libraries and Application Software on JUQUEEN and JURECA
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course November 2015 I.Gutheil Outline General Informations Sequential Libraries Parallel
More informationXSEDE New User Tutorial
May 13, 2016 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Please complete a short on-line survey about this module at http://bit.ly/hamptonxsede.
More informationCompiling applications for the Cray XC
Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers
More informationXSEDE New User Tutorial
October 20, 2017 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Please complete a short on line survey about this module at http://bit.ly/xsedesurvey.
More informationOptimization and Scalability
Optimization and Scalability Drew Dolgert CAC 29 May 2009 Intro to Parallel Computing 5/29/2009 www.cac.cornell.edu 1 Great Little Program What happens when I run it on the cluster? How can I make it faster?
More informationIntroduction to HPC Numerical libraries on FERMI and PLX
Introduction to HPC Numerical libraries on FERMI and PLX HPC Numerical Libraries 11-12-13 March 2013 a.marani@cineca.it WELCOME!! The goal of this course is to show you how to get advantage of some of
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps Controlling Execution Tuning Vectorization
More informationAdvanced MD performance tuning. 15/09/2017 High Performance Molecular Dynamics, Bologna,
Advanced MD performance tuning 15/09/2017 High Performance Molecular Dynamics, Bologna, 2017 1 General Strategy for improving performance Request CPU time from an HPC centre and investigate what resources
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationIntroduction to GALILEO
Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Alessandro Grottesi a.grottesi@cineca.it SuperComputing Applications and
More informationAltix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!
Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!! thread synchronization!! data sharing!! massage passing overhead from CPUs.! This system has a rich
More informationParallel Performance and Optimization
Parallel Performance and Optimization Erik Schnetter Gregory G. Howes Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa May 20-22, 2013 Thank you Ben Rogers Glenn Johnson
More informationBenchmark results on Knight Landing (KNL) architecture
Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3
More informationCombining OpenMP and MPI. Timothy H. Kaiser,Ph.D..
Combining OpenMP and MPI Timothy H. Kaiser,Ph.D.. tkaiser@mines.edu 1 Overview Discuss why we combine MPI and OpenMP Intel Compiler Portland Group Compiler Run Scripts Challenge: What works for Stommel
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationPractical Introduction to Message-Passing Interface (MPI)
1 Outline of the workshop 2 Practical Introduction to Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Theoretical / practical introduction Parallelizing your
More informationMathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN JSC Training Course May 22, 2012 Outline General Informations Sequential Libraries Parallel
More informationBefore We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop
Before We Start Sign in hpcxx account slips Windows Users: Download PuTTY Google PuTTY First result Save putty.exe to Desktop Research Computing at Virginia Tech Advanced Research Computing Compute Resources
More informationUsing Compute Canada. Masao Fujinaga Information Services and Technology University of Alberta
Using Compute Canada Masao Fujinaga Information Services and Technology University of Alberta Introduction to cedar batch system jobs are queued priority depends on allocation and past usage Cedar Nodes
More informationLAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor Lars Koesterke John D. McCalpin lars@tacc.utexas.edu mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps
More informationKohinoor queuing document
List of SGE Commands: qsub : Submit a job to SGE Kohinoor queuing document qstat : Determine the status of a job qdel : Delete a job qhost : Display Node information Some useful commands $qstat f -- Specifies
More informationImproving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism
Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum
More informationMathematical Libraries and Application Software on JUQUEEN and JURECA
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course May 2017 I.Gutheil Outline General Informations Sequential Libraries Parallel
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More informationAdrian Tate XK6 / openacc workshop Manno, Mar
Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous
More informationToward Building up Arm HPC Ecosystem --Fujitsu s Activities--
Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Shinji Sumimoto, Ph.D. Next Generation Technical Computing Unit FUJITSU LIMITED Jun. 28 th, 2018 0 Copyright 2018 FUJITSU LIMITED Outline of
More informationHOKUSAI System. Figure 0-1 System diagram
HOKUSAI System October 11, 2017 Information Systems Division, RIKEN 1.1 System Overview The HOKUSAI system consists of the following key components: - Massively Parallel Computer(GWMPC,BWMPC) - Application
More informationPortable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific
More informationParallel Performance and Optimization
Parallel Performance and Optimization Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 25-26 August
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationOpenMP threading on Mio and AuN. Timothy H. Kaiser, Ph.D. Feb 23, 2015
OpenMP threading on Mio and AuN. Timothy H. Kaiser, Ph.D. Feb 23, 2015 Abstract The nodes on Mio have between 8 and 24 cores each. AuN nodes have 16 cores. Mc2 nodes also have 16 cores each. Many people
More informationAdvanced School in High Performance and GRID Computing November Mathematical Libraries. Part I
1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street
More informationSymmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center
Symmetric Computing ISC 2015 July 2015 John Cazes Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required:
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationSymmetric Computing. SC 14 Jerome VIENNE
Symmetric Computing SC 14 Jerome VIENNE viennej@tacc.utexas.edu Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationSGI OpenFOAM TM Quick Start Guide
SGI OpenFOAM TM Quick Start Guide 007 5817 001 COPYRIGHT 2012, SGI. All rights reserved; provided portions may be copyright in third parties, as indicated elsewhere herein. No permission is granted to
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Outline Overview What is a native application? Why run native? Getting Started: Building a
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationIntroduction to GALILEO
Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Maurizio Cremonesi m.cremonesi@cineca.it
More informationPractical Introduction to
1 2 Outline of the workshop Practical Introduction to What is ScaleMP? When do we need it? How do we run codes on the ScaleMP node on the ScaleMP Guillimin cluster? How to run programs efficiently on ScaleMP?
More informationWorking on the NewRiver Cluster
Working on the NewRiver Cluster CMDA3634: Computer Science Foundations for Computational Modeling and Data Analytics 22 February 2018 NewRiver is a computing cluster provided by Virginia Tech s Advanced
More informationIntel C++ Compiler Professional Edition 11.0 for Windows* In-Depth
Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel C++ Compiler Professional Edition for Windows*..... 3 Intel C++ Compiler Professional Edition At A Glance...3 Intel C++
More informationL14 Supercomputing - Part 2
Geophysical Computing L14-1 L14 Supercomputing - Part 2 1. MPI Code Structure Writing parallel code can be done in either C or Fortran. The Message Passing Interface (MPI) is just a set of subroutines
More informationBatch environment PBS (Running applications on the Cray XC30) 1/18/2016
Batch environment PBS (Running applications on the Cray XC30) 1/18/2016 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch
More informationSymmetric Computing. Jerome Vienne Texas Advanced Computing Center
Symmetric Computing Jerome Vienne Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationVienna Scientific Cluster: Problems and Solutions
Vienna Scientific Cluster: Problems and Solutions Dieter Kvasnicka Neusiedl/See February 28 th, 2012 Part I Past VSC History Infrastructure Electric Power May 2011: 1 transformer 5kV Now: 4-5 transformer
More informationIntroduction to Cheyenne. 12 January, 2017 Consulting Services Group Brian Vanderwende
Introduction to Cheyenne 12 January, 2017 Consulting Services Group Brian Vanderwende Topics we will cover Technical specs of the Cheyenne supercomputer and expanded GLADE file systems The Cheyenne computing
More informationMPI introduction - exercises -
MPI introduction - exercises - Paolo Ramieri, Maurizio Cremonesi May 2016 Startup notes Access the server and go on scratch partition: ssh a08tra49@login.galileo.cineca.it cd $CINECA_SCRATCH Create a job
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationDomain Decomposition: Computational Fluid Dynamics
Domain Decomposition: Computational Fluid Dynamics May 24, 2015 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will
More informationCray RS Programming Environment
Cray RS Programming Environment Gail Alverson Cray Inc. Cray Proprietary Red Storm Red Storm is a supercomputer system leveraging over 10,000 AMD Opteron processors connected by an innovative high speed,
More informationAdvanced Job Launching. mapping applications to hardware
Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and
More informationSymmetric Computing. John Cazes Texas Advanced Computing Center
Symmetric Computing John Cazes Texas Advanced Computing Center Symmetric Computing Run MPI tasks on both MIC and host and across nodes Also called heterogeneous computing Two executables are required:
More informationCode Optimization. Brandon Barker Computational Scientist Cornell University Center for Advanced Computing (CAC)
Code Optimization Brandon Barker Computational Scientist Cornell University Center for Advanced Computing (CAC) brandon.barker@cornell.edu Workshop: High Performance Computing on Stampede January 15, 2015
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationMonitoring and Trouble Shooting on BioHPC
Monitoring and Trouble Shooting on BioHPC [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2017-03-15 Why Monitoring & Troubleshooting data code Monitoring jobs running
More informationMIC Lab Parallel Computing on Stampede
MIC Lab Parallel Computing on Stampede Aaron Birkland and Steve Lantz Cornell Center for Advanced Computing June 11 & 18, 2013 1 Interactive Launching This exercise will walk through interactively launching
More informationParameter searches and the batch system
Parameter searches and the batch system Scientific Computing Group css@rrzn.uni-hannover.de Parameter searches and the batch system Scientific Computing Group 1st of October 2012 1 Contents 1 Parameter
More informationAASPI Software Structure
AASPI Software Structure Introduction The AASPI software comprises a rich collection of seismic attribute generation, data conditioning, and multiattribute machine-learning analysis tools constructed by
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationKNL Performance Comparison: CP2K. March 2017
KNL Performance Comparison: CP2K March 2017 1. Compilation, Setup and Input 2 Compilation CP2K and its supporting libraries libint, libxsmm, libgrid, and libxc were compiled following the instructions
More informationKISTI TACHYON2 SYSTEM Quick User Guide
KISTI TACHYON2 SYSTEM Quick User Guide Ver. 2.4 2017. Feb. SupercomputingCenter 1. TACHYON 2 System Overview Section Specs Model SUN Blade 6275 CPU Intel Xeon X5570 2.93GHz(Nehalem) Nodes 3,200 total Cores
More informationDebugging, benchmarking, tuning i.e. software development tools. Martin Čuma Center for High Performance Computing University of Utah
Debugging, benchmarking, tuning i.e. software development tools Martin Čuma Center for High Performance Computing University of Utah m.cuma@utah.edu SW development tools Development environments Compilers
More informationXeon Phi Native Mode - Sharpen Exercise
Xeon Phi Native Mode - Sharpen Exercise Fiona Reid, Andrew Turner, Dominic Sloan-Murphy, David Henty, Adrian Jackson Contents April 30, 2015 1 Aims The aim of this exercise is to get you compiling and
More informationEmpirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms
Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Javier Cuenca Luis-Pedro García Domingo Giménez Francisco J. Herrera Scientific Computing and Parallel
More informationGuillimin HPC Users Meeting October 20, 2016
Guillimin HPC Users Meeting October 20, 2016 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit
More informationUAntwerpen, 24 June 2016
Tier-1b Info Session UAntwerpen, 24 June 2016 VSC HPC environment Tier - 0 47 PF Tier -1 623 TF Tier -2 510 Tf 16,240 CPU cores 128/256 GB memory/node IB EDR interconnect Tier -3 HOPPER/TURING STEVIN THINKING/CEREBRO
More informationMulticore Performance and Tools. Part 1: Topology, affinity, clock speed
Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and
More informationIntroduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah
Introduction to Parallel Programming Martin Čuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu Overview Types of parallel computers. Parallel programming options. How to
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationIntroduction to HPC2N
Introduction to HPC2N Birgitte Brydsø HPC2N, Umeå University 4 May 2017 1 / 24 Overview Kebnekaise and Abisko Using our systems The File System The Module System Overview Compiler Tool Chains Examples
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More information