Computing and energy performance
|
|
- Beverly Snow
- 5 years ago
- Views:
Transcription
1 Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas Jost 13/01/2011
2 1 First experiments on GPU clusters
3 2 x 16 CPU+GPU nodes clusters : Xeon dual core + GT8800 Xeon dual core + GT285 First experiments on GPU clusters Experimental testbed Nehalem quad core + GT285 Nehalem quad core + GT480 a 16 nodes state of the art CPU+GPU cluster an older 16 nodes CPU+GPU cluster 2 gigabit Ethernet switches an heterogeneous 32 nodes cluster regular upgrade of the system Some energy monitoring external devices : Raritan (Dominion PX)
4 First experiments on GPU clusters Collection of experiments 3 benchmarks with different features: 1 European option pricer: Embarrassinglyparallel ll l Monte Carlo computations Parallel random number generator have been ported on GPU 2 PDE solver: Strong computations Regular communications between nodes Some computations (must) remain on CPU 3 2D Jacobi relaxation: Repetitive light computations Frequent communications between neighbor nodes
5 First experiments on GPU clusters Collection of experiments 1E+5 Pricer parallel MC 1E+3 EDP Solver Synchronous 1E+4 Jacobi Relaxation Executio on time (s) 1E+4 1E+3 1E+2 1E+1 1E Number of nodes Execut tion time (s) 1E+2 1E+1 1E Number of nodes Execut tion time (s) 1E+3 1E+2 1E Number of nodes Monocore CPU cluster Multicore CPU cluster Manycore GPU cluster 1E+4 Pricer parallel MC 1E+2 EDP Solver Synchronous 1E+3 Jacobi Relaxation Energy (Wh) 1E+3 1E+2 1E+1 Energy (Wh) 1E+1 Energy (Wh) 1E+2 1E+1 1E Number of nodes 1E Number of nodes 1E Number of nodes
6 First experiments on GPU clusters Computational & energy model design Temporal gain (speedup) p) and energy gain of GPU cluster vs CPU cluster: Energy gain Speedup ter vs PU cluster GPU clus multicore CP 1E+3 1E+2 1E+1 1E+0 Pricer parallel MC OK Number of nodes er vs U cluster GPU clust multicore CP 1E+2 1E+1 1E+0 EDP Solver synchronous Hum Number of nodes GPU clust ter vs multicore CPU cluster 1E+2 1E+1 1E+0 Jacobi Relaxation Beyond? Predictions? Number of nodes Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases : why? beyond 16 nodes?
7 First experiments on GPU clusters Computational & energy model design Temporal gain (speedup) p) and energy gain of GPU cluster vs CPU cluster: Energy gain Speedup ter vs PU cluster GPU clus multicore CP 1E+3 1E+2 1E+1 1E+0 Pricer parallel MC OK Number of nodes er vs U cluster GPU clust multicore CP 1E+2 1E+1 1E+0 EDP Solver synchronous Hum Number of nodes GPU clust ter vs multicore CPU cluster 1E+2 1E+1 1E+0 Jacobi Relaxation Beyond? Predictions? Number of nodes Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases : why? beyond 16 nodes?
8 First experiments on GPU clusters Computational & energy model design CPU cluster GPU cluster Computations T calc CPU If algorithm is adapted to GPU architecture: T calc GPU << T calc CPU Communications T comm CPU = T comm MPI else: do not use GPUs! T comm GPU T comm CPU T comm GPU = T transfert GPUtoCPU + T comm MPI + T transfert CPUtoGPU t Total time T CPU T GPU <? > T CPU.. For a set pb: when the number of nodes increases,t comm becomes dominant and GPU cluster interest decreases
9 2 A first performance model
10 A first performance model First modelling approach Observation of the first experimental performances: it exists a «scalable area», performances of CPU and GPU clusters have different slopes. Execut tion time (s) 1E+3 1E+2 1E+1 1E+0 EDP Solver Synchronous Number of nodes Scalable area Modelling of the «scalablearea» assuming some experimental measurements of the real application are possible (simple modelling). Energ gy (Wh) GPU cluster vs multicore CPU cluster 1E+2 1E+1 1E+0 1E+2 1E+1 1E+0 EDP Solver Synchronous Number of nodes Scalable area EDP Solver synchronous Number of nodes
11 A first performance model First modelling approach We model the «scalablearea»: T E σ GPU CPU σ E T CPU GPU σ σ E T N (nodes) N (nodes) We consider the electrical lpower dissipated by nodes and switch : We observe: T(N)=T(1)/N σ T E(N)=E(1).N σ E We consider: with: CPU σ T GPU > σ T E(N) = P(1).T(N).N+Pswitch.N/Nmax.T(N) Pswitch.N/Nmax.T(N) with P: electrical power (Watts)
12 A first performance model First modelling approach We obtain: σ E = 1 σ T GPU σ SU (N) = SU (1).N T GPU σ = T EG (N) EG (1).N G/C G/C σ CPU T G/C G/C σ T CPU We compute the 2 threshold number of nodes: G/C T 1/( σ GPU ) T G/C σ CPU T G/C 1/( ) N = N = SU (1) SU (N) = 1 G/C E 1/( σ GPU ) T G/C σ CPU T G/C 1/( ) N = N = EG (1) EG (N) = 1
13 A first performance model First modelling approach 3 areas appear when increasing the number of nodes: GPU cluster more efficient (about T and E) GPU cluster faster OR less energy consumming CPU cluster more efficient (about T and E) 1 min(, ) max(, ) N (nodes) N G/C T N G/C E N G/C T N G/C E Choose GPU cluster Strategy and heuristic required to choose GPU or CPU cluster Choose CPU cluster
14 Improving performances with asynchronous algorithms Investigation with our PDE solver
15 Improving performances with asynchronous algorithms Asynchronous parallel computing Asynchronous algo. provide implicit overlapping of communications and computations, and communications are important on GPU clusters. But : They should improve executions on GPU clusters Some iterative algorithms can be turned into asynchronous algorithms (not all), A strong mathematical theory supports this approach. And : The convergence detection of the algorithm is more complex and requires more communications (than with synchronous algo) Some extra iterations are required to achieve the same accuracy.
16 Improving performances with asynchronous algorithms Parallel iterative PDE solver
17 Improving performances with asynchronous algorithms Inner linear solver
18 Improving performances with asynchronous algorithms Asynchronous version and really more complex parallel implementation!
19 Improving performances with asynchronous algorithms Asynchronous version
20 Improving performances with asynchronous algorithms Performances on a heterogeneous cluster Execution time usingbothgpu clusters of Supelec (to minimize): 17 nodes Xeon dual core + GT nodes Nehalem quad core + GT285 2 interconnected Gibagit switches Rmk: two clusters managed by one OAR server T exec(s) GPUs & synchronous T exec(s) GPUs & asynchronous Nb of fast nodes Nb of fast nodes
21 Improving performances with asynchronous algorithms Performances on a heterogeneous cluster Speedup vs 1 GPU (to maximize): asynchronous version achieves more regular speedup asynchronous version achieves better speedup on high nb of nodes GPU cluster & synchronous vs 1 GPU GPU cluster & asynchronous vs 1 GPU Sync. Speedup vs seq. Sync. Speedup vs seq. Nb of fast nodes Nb of fast nodes
22 Improving performances with asynchronous algorithms Performances on a heterogeneous cluster Energy consumption (to minimize): measurement errors become important sync. and async. energy consumption curves are (just) «different» GPU cluster & synchronous GPU cluster & asynchronous Energy co onsumptio on(w.h) Nb of fast nodes Energy co onsumptio on(w.h) Nb of fast nodes
23 Improving performances with asynchronous algorithms Performances on a heterogeneous cluster Energy overhead factor vs 1 GPU (to minimize): overhead curves are (just) «differents» no more global attractive solution! GPU cluster & synchronous vs 1 GPU GPU cluster & asynchronous vs 1 GPU Energy overhead factor Nb of fast nodes factor Energy overhead Nb of fast nodes
24 4 Need for a new performance model and an auto adaptive solution
25 Need for a new performance model and an auto adaptive solution Relative async. vs sync. performances Relative async vs sync speedup and energy gain exhibit some similarities: can be used to choose the version to run need a fine model (region frontiers are complex) need a heuristic when only one gain is greater than 1 Speedup Async. better Energy gain Async. better
26 Need for a new performance model and an auto adaptive solution Relative async. vs sync. performances Energy Delay Product (EDP) (to minimize): to track a global optimum, considering both T and E parallel runs on many nodes seem better no large differences between sync. and async. versions GPU cluster & synchronous GPU cluster & asynchronous
27 Need for a new performance model and an auto adaptive solution Relative async. vs sync. performances Async. vs sync. relative Energy Delay Product ratio: We compute: EDP sync / EDP async can be used to make a choice inside ambiguous regions Compute the EDP and choose sync. or async. version in this region (where relative SU > 1 and relative EG < 1) Choose async. version Choose sync. version
28 Need for a new performance model and an auto adaptive solution Automatic choice criteria Automatic selection of the «best» version to run: Synchronous algo. Asynchronous algo. CPU cluster GPU cluster Criteria: relative speedup : tracking HPC performances relative energy gain : tracking low energy consumption relative energy delay product : tracking a compromise but need a model dlto automatize thischoice h i a fine model : criteria variations are small and region frontiers arecomplex a model not requiring long and large experiments
29 Need for a new performance model and an auto adaptive solution Fine model required First model limitations: requires/assumes a «scalable area» approximative model (not fine) requires 4 executions of the entire application: on 1 and N 0 nodes running the 2 versions to compare measuring both T and E T CPU σ T adapted to optimize the execution of a long life application scaling on a parallel machine σ T GPU N (nodes) To achieve an automatic selection on a short life application, we need : a model requiring only small elementary benchmarks to fix the model parameter values on the hardware used, a fine model not requiring the application exhibit a perfect scalability on the architecture.
30 Need for a new performance model and an auto adaptive solution Fine model required First version of this fine model exists It takes into account: different power dissipation of the different «identical» nodes of the cluster when starting computations on GPU the power dissipation increases 2 times: when the GPU starts to compute when the fan of the GPU starts, or/and when the GPU increases its frequency, when stopping computations on GPU the power dissipation decreases several times, but not immediately!
31 Need for a new performance model and an auto adaptive solution Fine model required (Watts) (s)
32 Need for a new performance model and an auto adaptive solution Fine model required First evaluation: on our PDE solver execution and on our heterogeneous GPU cluster error (model observation) : 6% aftersomebiascorrections : 1% Stronger evaluation is planned: on different hardware on different applications then: a heuristicof auto adaptation/auto selection t ti t ti of the right algorithm will be implemented To be continued
33 5 Conclusion and perspectives
34 Long term objectives Heuristic Performance model (energy and computations) Elementary hardware benchmarks Kernel 1 v1 Kernel 1 v2 Kernel 1 v3 Kernel 2 v1 Kernel 2 v2 Parallel algorithm 1 Parallel algorithm 2 a.out O [speed, energy, edp, ] Auto adaptation of a.out End after fast execution End after edp compromise execution user scheduler End after low energy consumption execution
35 Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization of a multi algorithms PDE solver on CPU and GPU clusters Questions?
Energy issues of GPU computing clusters
AlGorille INRIA Project Team Energy issues of GPU computing clusters Stéphane Vialle SUPELEC UMI GT CNRS 2958 & AlGorille INRIA Project Team EJC 19 20/11/2012 Lyon, France What means «using a GPU cluster»?
More informationImpact of asynchronism on GPU accelerated parallel iterative computations
Impact of asynchronism on GPU accelerated parallel iterative computations Sylvain Contassot-Vivier 1,2, Thomas Jost 2, and Stéphane Vialle 2,3 1 Loria, University Henri Poincaré, Nancy, France Sylvain.Contassotvivier@loria.fr
More informationAsian Option Pricing on cluster of GPUs: First Results
Asian Option Pricing on cluster of GPUs: First Results (ANR project «GCPMF») S. Vialle SUPELEC L. Abbas-Turki ENPC With the help of P. Mercier (SUPELEC). Previous work of G. Noaje March-June 2008. 1 Building
More informationLixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationA Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis.
A Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis. Arnaud Defrance, Stéphane Vialle, Morgann Wauquier Firstname.Lastname@supelec.fr Supelec, 2 rue Edouard
More informationOptimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance
Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationHardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB
Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Frommelt Thomas* and Gutser Raphael SGL Carbon GmbH *Corresponding author: Werner-von-Siemens Straße 18, 86405 Meitingen,
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationSlurm Configuration Impact on Benchmarking
Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationPortable Power/Performance Benchmarking and Analysis with WattProf
Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationA Simulated Annealing algorithm for GPU clusters
A Simulated Annealing algorithm for GPU clusters Institute of Computer Science Warsaw University of Technology Parallel Processing and Applied Mathematics 2011 1 Introduction 2 3 The lower level The upper
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationAn Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC
An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC Pi-Yueh Chuang The George Washington University Fernanda S. Foertter Oak Ridge National Laboratory Goal Develop an OpenACC
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationvs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs
Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing
More informationHigh-Performance Data Loading and Augmentation for Deep Neural Network Training
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationThe Cray CX1 puts massive power and flexibility right where you need it in your workgroup
The Cray CX1 puts massive power and flexibility right where you need it in your workgroup Up to 96 cores of Intel 5600 compute power 3D visualization Up to 32TB of storage GPU acceleration Small footprint
More informationOpen Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,
Open Compute Stack (OpenCS) Overview D.D. Nikolić Updated: 20 August 2018 DAE Tools Project, http://www.daetools.com/opencs What is OpenCS? A framework for: Platform-independent model specification 1.
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationContour Detection on Mobile Platforms
Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationA Global Operating System for HPC Clusters
A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ
More informationECE 697J Advanced Topics in Computer Networks
ECE 697J Advanced Topics in Computer Networks Switching Fabrics 10/02/03 Tilman Wolf 1 Router Data Path Last class: Single CPU is not fast enough for processing packets Multiple advanced processors in
More informationNUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems
NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationAn Innovative Massively Parallelized Molecular Dynamic Software
Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources An Innovative Massively Parallelized Molecular Dynamic Software Mohamed Hacene, Ani Anciaux,
More informationOptimizing DMA Data Transfers for Embedded Multi-Cores
Optimizing DMA Data Transfers for Embedded Multi-Cores Selma Saïdi Jury members: Oded Maler: Dir. de these Ahmed Bouajjani: President du Jury Luca Benini: Rapporteur Albert Cohen: Rapporteur Eric Flamand:
More information2008 International ANSYS Conference
2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationParallel Computing. Parallel Computing. Hwansoo Han
Parallel Computing Parallel Computing Hwansoo Han What is Parallel Computing? Software with multiple threads Parallel vs. concurrent Parallel computing executes multiple threads at the same time on multiple
More informationMULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA
MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationPhase-Based Application-Driven Power Management on the Single-chip Cloud Computer
Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig Introduction
More informationMaximizing Memory Performance for ANSYS Simulations
Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance
More informationNetwork-on-Chip Architecture
Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationComputer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.
Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters 2006 ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Proprietary Our Business Simulation Driven Product Development Deliver superior
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationLet s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.
Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein
More informationxsim The Extreme-Scale Simulator
www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationAlgorithms, System and Data Centre Optimisation for Energy Efficient HPC
2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationBuilding supercomputers from embedded technologies
http://www.montblanc-project.eu Building supercomputers from embedded technologies Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationTowards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters
Towards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters 1 University of California, Santa Barbara, 2 Hewlett Packard Labs, and 3 Hewlett Packard Enterprise 1
More informationPARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM
PARALLELIZATION OF THE NELDER-MEAD SIMPLEX ALGORITHM Scott Wu Montgomery Blair High School Silver Spring, Maryland Paul Kienzle Center for Neutron Research, National Institute of Standards and Technology
More informationThe ECM (Execution-Cache-Memory) Performance Model
The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationMulticore from an Application s Perspective. Erik Hagersten Uppsala Universitet
Multicore from an Application s Perspective Erik Hagersten Uppsala Universitet Communication in an SMP A: B: Shared Memory $ $ $ Thread Thread Thread Read A Read A Read A... Read A Write A Read B Read
More informationClusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory
Clusters Rob Kunz and Justin Watson Penn State Applied Research Laboratory rfk102@psu.edu Contents Beowulf Cluster History Hardware Elements Networking Software Performance & Scalability Infrastructure
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationOutline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl
More informationNetwork Storage Solutions for Computer Clusters Florin Bogdan MANOLACHE Carnegie Mellon University
Network Storage Solutions for Computer Clusters Florin Bogdan MANOLACHE Carnegie Mellon University Email: florin@cmu.edu Octavian RUSU Adrian DUMITRASC Alexandru Ioan Cuza University Carnegie Mellon University
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationAggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop
More informationEvaluation and Improvements of Programming Models for the Intel SCC Many-core Processor
Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Carsten Clauss, Stefan Lankes, Pablo Reble, Thomas Bemmerl International Workshop on New Algorithms and Programming
More informationStatic and dynamic processing of discrete data structures
Static and dynamic processing of discrete data structures C2S@Exa François Pellegrini Contents 1. Context 2. Works 3. Results to date 4. Perspectives François Pellegrini Pole 3: Discrete data structures
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationGPI-2: a PGAS API for asynchronous and scalable parallel applications
GPI-2: a PGAS API for asynchronous and scalable parallel applications Rui Machado CC-HPC, Fraunhofer ITWM Barcelona, 13 Jan. 2014 1 Fraunhofer ITWM CC-HPC Fraunhofer Institute for Industrial Mathematics
More informationModelling Multi-GPU Systems 1
Modelling Multi-GPU Systems 1 Daniele G. SPAMPINATO a, Anne C. ELSTER a and Thorvald NATVIG a a Norwegian University of Science and Technology (NTNU), Trondheim, Norway Abstract. Due to the power and frequency
More informationAsynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms
Asynchronous Parallel Stochastic Gradient Descent A Numeric Core for Scalable Distributed Machine Learning Algorithms J. Keuper and F.-J. Pfreundt Competence Center High Performance Computing Fraunhofer
More informationHeterogeneous platforms
Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk
More informationEfficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics
1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department
More informationScalable and Fault Tolerant Failure Detection and Consensus
EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann
More informationAlgorithmic scheme for hybrid computing with CPU, Xeon-Phi/MIC and GPU devices on a single machine
Algorithmic scheme for hybrid computing with CPU, Xeon-Phi/MIC and GPU devices on a single machine Sylvain CONTASSOT-VIVIER a and Stephane VIALLE b a Loria - UMR 7503, Université de Lorraine, Nancy, France
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationReduce Costs & Increase Oracle Database OLTP Workload Service Levels:
Reduce Costs & Increase Oracle Database OLTP Workload Service Levels: PowerEdge 2950 Consolidation to PowerEdge 11th Generation A Dell Technical White Paper Dell Database Solutions Engineering Balamurugan
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationTechniques to improve the scalability of Checkpoint-Restart
Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationHPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)
HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationJava Heap Resizing From Hacked-up Heuristics to Mathematical Models. Jeremy Singer and David R. White
Java Heap Resizing From Hacked-up Heuristics to Mathematical Models Jeremy Singer and David R. White Outline Background Microeconomic Theory Heap Sizing as a Control Problem Summary Outline Background
More information