PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS
|
|
- Silvia Bennett
- 5 years ago
- Views:
Transcription
1 PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS Ricard Borrell Pol Head and Mass Transfer Technological Center cttc.upc.edu Termo Fluids S.L termofluids.co Barcelona Supercomputing Center BSC.es
2 Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks
3 Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks
4 Moore s Law Moore's law is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years, wikipedia
5 Moore s Law Equivalent in HPC: Number of FLOP/s double approximately every two years (for LINPACK benchmark) top500.org
6 LINPACK vs HPCG (74% of peak) (64%) (65%) (85%) (93%) (1.1% of peak) (4.4%) (0.3%!!) (1.2%) Top supercomputers run very well LINPACK benchmark but to solve PDEs are very inefficient (1.6%) Top500.org & hpcg-benchmark.org
7 Memory wall ~6% Karl Rupp: J.Dongarra: ATPESC 2015 The arithmetic intensity needed to achieve the peak performance of comp. devices keeps growing The dominant kernel in CFD is the SpMV or equivalent stencil operations Flops/byte: BLAS1 ~ 1/8, SpMV ~ 1.4/8*, BLAS2~ 2/8, BLAS3 ~ (2/3 n)/8 (double precision) The performance of a CFD code is placed between BLAS1 and BLAS2 Memory bandwidth is the limiting factor: performance relies on counting bytes no flops * Laplacian discretization in a tetrahedral mesh ~3%
8 Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks
9 TermoFluids CODE [1/6] General purpose unstructured CFD code Based on finite volume symmetry-preserving discretization on unstructured meshes Includes several LES and Regularization models for incompressible turbulent flows Expansion to multi-physics simulations: multi-phase flows, particles propagation, reactive flows, fluid structure interactions, multi-fluid flows, dynamic meshes...
10 TermoFluids CODE [2/6] HPC at TermoFluids C++ object oriented Parallelization based on the distributed memory model (pure MPI) recently developed hybrid model with GPU co-processors (MPI+CUDA) Performance barriers: Synchronism: inter CPU communications (point-to-point, all-reduce) Flops: Low arithmetic intensity memory wall Random memory accesses Curie TGCC MareNostrum BSC JFF CTTC Lomonosov MSU Mira ALCF MinoTauro BSC
11 TermoFluids CODE [3/6] Largest scalability tests*: Performed on Mira supercomputer (BG/Q) of the Argonne Leadership Computing Facility (ALCF) 76% Scalability tests up to 131K CPU-cores 67% All phases of simulation analyzed at the largest scale: pre-processing, check-pointing (IO)... Test case: Differentially Heated cavity * for last points only 15K and 7K cells/core respectively *R. Borrell, J. Chiva, O. Lehmkuhl, I. Rodriguez and A. Oliva. Evolving TermoFluids CFD code towards peta-sacle simulations. International Journal of Computational Fluids Dynamics. In press.
12 TermoFluids CODE [4/6] Largest production simulations performed in the context of PRACE Tier0 projects 6th PRACE CALL: 10th PRACE CALL: DRAGON Understanding the DRAG crisi: ON the flow past a circular cylinder form critical to transcritical Reynolds numbers Direct Numerical Simulation of Gravity Driven Bubbly Flows 23M hours (largest simulation 4096 CPUcores) 22M hours (largest simulation 3072 CPUcores)
13 TermoFluids CODE [5/6]
14 TermoFluids CODE [6/6] Industrial applications: Same software libraries used for leading edge computational projects and industrial applications ENAIR: 3D simulation of wind turbine blades x CLEAN SKY: EFFAN - optimization of the electrical ram air fan used in all-electrical aircrafts HP: Simulation of 3D printers
15 Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks
16 Generic algebraic approach [1/5] Applied to the LES simulation of turbulent flow of incompressible Newtonian fluids Finite Volume second order symmetry-preserving discretization Temporal discretization based on a second order explicit Adams-Bashforth scheme Pressure-velocity coupling: fractional step projection method
17 Generic algebraic approach [2/5] We are in a disruptive moment where different HPC solutions compete Portability across many architectures is a must We are developing an algebraic Generic Integration Platform (GIP) to perform time integrations: TIME INTEGRATION based on STENCIL OPERATIONS TIME INTEGRATION based on ALGEBRAIC KERNELS Code portability Code modularity G.Oyarzun, R. Borrell, A. Gorobets and A. Oliva. Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers. SIAM Journal on Scientific Computing Under review
18 Generic algebraic approach [3/5] Algebraic kernels: Vector-vector operations AXPY: Y=ax+y DOT product Sparse matrix vector product (SpMV) Non linear operators (convective term): Convective term decomposed into two SpMVs Similar process to modify the diffusive term according to the turbulent viscosity
19 Generic algebraic approach [4/5] generic integration platform Generic approach: CFD time integration depends on the specific implementation of 4 abstract classes The algebraic operators are imported form external code:termofluids, openfoam, Saturne etc The GIP can be used to port the time integration of other simulation codes to new architectures Diagram of the implementation strategy
20 Generic algebraic approach [5/5] 98% of time integration is spent in only three algebraic kernels This situation favors the portability of code through different computing platforms Outside Poisson solver number SpMV 30 AXPY 10 DOT 2 PCG iteration number SpMV 2 AXPY 3 DOT % 8.79% 9.12% 80.77% SpMV AXPY DOT OTHERS LES simulation flow around ASMO CAR, mesh 5.5M 32 GPUS
21 Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks
22 Accelerators in HPC [1/2] Accelerators becoming increasingly popular in leading edge supercomputers Potential to significantly reduce space, power consumption, and cooling demands Context: Constrained power consumption target (~25MW for the entire system ) power wall top500.org list June % of the Top500 list systems are based on hybrid nodes Considering the first 15 positions of Top500 list 8 (53%) are based in hybrid nodes 100% of the fist 15 positions in the Green500 list are hybrid nodes with accelerators (NVIDIA)
23 Accelerators in HPC [2/2] Design goals for CPUs Design goals for GPUs Make a single thread very fast Throughput matters and single threads do not Reduce latency through large caches More transistors dedicated to computation Predict, speculate Hide memory latency through concurrency Remove modules to make simple instruction fast (out-of-order control logic, branch predictor logic, memory pre fetch unit) Share the cost of instruction stream across many ALUs (SIMD model) Multiple context per stream multiprocessor (SM) hide latency Source: Tim Warburton ATPESC 2014
24 MinoTauro Supercomputer MinoTauro (BSC) was used in the present work Nodes: 2 Intel E5649 (6-Core) processors at 2.53 GHz (Westmere) 12 GB RAM per CPU 2 M2090 NVIDIA GPU Cards (Tesla) 6 GB RAM per GPU Network: Infiniband QDR (40 Gbit each) to a non-blocking network
25 Implementation Algebraic kernels: Vector-vector operations CUBLAS 5.0 Sparse matrix vector product Sliced ELLPACK format (1) Grouping rows by number of entries (2) Use ELLPACK format on each subgrups GLOPS Device, SpMV format Mesh size (thousands of cells) Avg. speedup CPU CSR MKL x CPU ELLPACK GPU CSR cusparse x GPY HYB cusparse x GPU sliced ELLPACK
26 SPMV kernel [1/4] No ordering Cuthill - Mckee ordering Theoretically achievable performance (perfect locality and alignment supposed): Arithmetic intensity (Ax=b for uniform tetrahedral mesh) A bytes: (8*5*N)+(4*5*N) = 60N b bytes: 8N x bytes: 8N (max. cache reuse) SpMV bytes: 76N SpMV flops: 9N Flop/byte ratio: 9/76 = 0.12
27 SPMV kernel [2/4] Theoretically achievable performance (perfect locality and alignment supposed): Performance on Intel Xeon E5640 (6 core,turbo Freq GHz, Bandwidth 25.6 GB/s) Peak performance: 24 flops x cycle x 2.93 G cycles/s = Gflops/s (flops: (2 FMA + 2 SIMD) x 6 cores ) Time computations: 10N flop/70.32 Gflop/s = 0.14 N ns Time data comm.: 76N bytes/ 32 GB/s = 2,33 N ns Ratio: time comm../ time comp. ~17!! Achivable performance: 9/76 x 32 = 3.8 Gflops (~5% peak) Performance on NVIDIA M2090 (Tesla) Peak performance: Gflop/s Bandwidth: GB/s (ECC on) Achievable performance: 9/76 x 141,6 = 16.8 Gflops (~2.5% peak)
28 SPMV kernel [2/4] Theoretically achievable performance (perfect locality and alignment supposed): Performance on Intel Xeon E5640 (6 core,turbo Freq GHz, Bandwidth 25.6 GB/s) Peak performance: 24 flops x cycle x 2.93 G cycles/s = Gflops/s (flops: (2 FMA + 2 SIMD) x 6 cores ) Time computations: 10N flop/70.32 Gflop/s = 0.14 N ns Time data comm.: 76N bytes/ 32 GB/s = 2,33 N ns Ratio: time comm../ time comp. ~17!! Achivable performance: 9/76 x 32 = 3.8 Gflops (~5% peak) Performance on NVIDIA M2090 (Tesla) Peak performance: Gflop/s Bandwidth: GB/s (ECC on) Achievable performance: 9/76 x 141,6 = 16.8 Gflops (~2.5% peak) 4.4 = = Performance ratio equals Bandwidth ratio
29 SPMV kernel [3/4] Net performance on a single 6-core CPU (left) and on a single GPU (right)
30 SPMV kernel [4/4] Speedup GPU vs CPU For normal workloads per device, it's better exploited the bandwidth of GPUS! Remember: RAM CPU 12 GB (6 cores) RAM GPU 6 GB
31 Multi-GPU SPMV kernel [1/4] MPI + CUDA implementation Parallelization based on a domain decomposition One MPI-process per subdomain and one GPU per MPI-process Local data partition: separate inner parts (do not require data from other subdomains) from interface parts (require external elements) Local data partition + two stream model -> overlapping computations on GPU with communications
32 Multi-GPU SPMV kernel [2/4] (left): Weak speedup test up to 128 GPUs, (right): overlapping effect on the executions with 128 GPUs
33 Multi-GPU SPMV kernel [3/4] Strong scalability, (left): speedup, (right): parallel efficiency Note: CPU execution 1 device is 6 cores
34 Multi-GPU SPMV kernel [3/4] load GPU 400K 80% load GPU 200K 55% load GPU 100K 35% Strong scalability, (left): test, in terms of speedup or (right): parallel efficiency Note: CPU execution 1 device is 6 cores
35 Multi-GPU SPMV kernel [4/4] (left): normalized performance of SpMV computations (right): estimated speedup for hypothetical constant performance (canceling cache and occupancy effects)
36 Multi-GPU SPMV kernel [4/4] but GPU 4 times faster! (left):net performance of the computing part for the strong speedup test. (right): estimated speedup if performance remained constant (canceling cache and occupancy effects)
37 LES test: flow around ASMO car [1/4] Flow around ASMO car, Re=7e5 5.5 million unstructured mesh with prismatic boundary layer Sub-grid scale: wall-adapting local-eddy viscosity (WALE) Poisson solver: CG with Jacobi diagonal scaling Flow and turbulent structures around simplified car models. D.E. Aljure, O. Lehmkuhl,, I. Rodríguez, A. Oliva. Computers & Fluids 96 (2014)
38 LES test: flow around ASMO car [2/4] (left): Relative weight of the main operations for different number of CPUs and GPUS. (right): average relative weight over all tests
39 LES test: flow around ASMO car [3/4] Note: CPU execution 1 device is 6 cores Performance of overall CFD code in any system can be estimated testing only three algebraic kernels
40 LES test: flow around ASMO car [4/4] Speedup multi GPU vs multi CPU Note: CPU execution 1 device is 6 cores
41 Tests on Mont Blanc ARM [1/2] Mont Blanc: European project focused on the development of a new type of computer architecture capable of setting future global HPC standards, built from energy efficient solutions used in embedded and mobile devices Termo Fluids S.L is part of the Industrial User Group (IUG) We have run parallel LES simulations on MontBlanc nodes using the AGP platform Specifics: Load distribution SpMV kernel 100K rows OpenCL + openmp + MPI model required to engage all component of nodes Shared memory between CPU and GPU requires an accurate load distribution CPU: Cortex-A GHz dual core GPU: Mali T-604 (OpenCL 1.1 capable) Network: 10 Gbit/s Ethernet
42 Tests on Mont Blanc ARM [2/2] Similar performance of CPU and GPU makes meaningful hybridization Languages: CPU OpenMP, GPU OpenCL Synchronization points (clfinish()) required to maintain main memory coherence 16% 16% 40% 16% 16%
43 Tests on Mont Blanc ARM Weak speedup [2/2] Strong speedup
44 Outline Some notions of HPC TermoFluids CFD code Portable implementation model Application for hybrid clusters Concluding remarks
45 CONCLUDING REMARKS Exaflops come with disruptive changes in HPC technology We developed a portable version of our CFD code based on an algebraic operational approach ~98% com computing time is spent on three kernels: SpMV, AXPY, DOT The three kernels are clearly memory bounded: performance depends exclusively on the bandwidth achieved (not on flops) Bandwidth is more profitable with throughput oriented (latency hiding) approach of GPUs Overall time-step performance is perfectly estimable at any system by testing the three basic kernels Speedup of multi-gpu vs multi-cpu implementation on LES simulation of flow around ASMO car ranges from 4x to 8x at MinoTauro supercomputer
Two-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationAvailable online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 61 ( 2013 ) 81 86 Parallel Computational Fluid Dynamics Conference (ParCFD2013) An OpenCL-based parallel CFD code for simulations
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationJ. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst
Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationMulti-GPU simulations in OpenFOAM with SpeedIT technology.
Multi-GPU simulations in OpenFOAM with SpeedIT technology. Attempt I: SpeedIT GPU-based library of iterative solvers for Sparse Linear Algebra and CFD. Current version: 2.2. Version 1.0 in 2008. CMRS format
More informationCode Saturne on POWER8 clusters: First Investigations
Code Saturne on POWER8 clusters: First Investigations C. MOULINEC, V. SZEREMI, D.R. EMERSON (STFC Daresbury Lab., UK) Y. FOURNIER (EDF R&D, FR) P. VEZOLLE, L. ENAULT (IBM Montpellier, FR) B. ANLAUF, M.
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationFuture Generation Computer Systems
Future Generation Computer Systems 79 (2018) 786 796 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs Efficient CFD code implementation
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationAnalysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany
More informationcuibm A GPU Accelerated Immersed Boundary Method
cuibm A GPU Accelerated Immersed Boundary Method S. K. Layton, A. Krishnan and L. A. Barba Corresponding author: labarba@bu.edu Department of Mechanical Engineering, Boston University, Boston, MA, 225,
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationMatrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs
Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011
ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include
More informationEfficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling
Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike
More informationHigh Performance Computing (HPC) Introduction
High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationRealization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects
Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects Markus Geveler, Stefan Turek, Dirk Ribbrock PACO Magdeburg 2015 / 7 / 7 markus.geveler@math.tu-dortmund.de
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationHPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)
HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationParticleworks: Particle-based CAE Software fully ported to GPU
Particleworks: Particle-based CAE Software fully ported to GPU Introduction PrometechVideo_v3.2.3.wmv 3.5 min. Particleworks Why the particle method? Existing methods FEM, FVM, FLIP, Fluid calculation
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationPerformance Benefits of NVIDIA GPUs for LS-DYNA
Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationRadiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System
Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationHPC projects. Grischa Bolls
HPC projects Grischa Bolls Outline Why projects? 7th Framework Programme Infrastructure stack IDataCool, CoolMuc Mont-Blanc Poject Deep Project Exa2Green Project 2 Why projects? Pave the way for exascale
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationExploiting CUDA Dynamic Parallelism for low power ARM based prototypes
www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training
More informationBlock Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations
Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations D. Zheltkov, N. Zamarashkin INM RAS September 24, 2018 Scalability of Lanczos method Notations Matrix order
More informationGPU PROGRESS AND DIRECTIONS IN APPLIED CFD
Eleventh International Conference on CFD in the Minerals and Process Industries CSIRO, Melbourne, Australia 7-9 December 2015 GPU PROGRESS AND DIRECTIONS IN APPLIED CFD Stan POSEY 1*, Simon SEE 2, and
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationAccelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations
Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationANSYS HPC Technology Leadership
ANSYS HPC Technology Leadership 1 ANSYS, Inc. November 14, Why ANSYS Users Need HPC Insight you can t get any other way It s all about getting better insight into product behavior quicker! HPC enables
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationA Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters
A Massively Parallel Two-Phase Solver for Incompressible Fluids on Multi-GPU Clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn GPU
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationHPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,
HPC-CINECA infrastructure: The New Marconi System HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati, g.amati@cineca.it Agenda 1. New Marconi system Roadmap Some performance info
More informationReal Application Performance and Beyond
Real Application Performance and Beyond Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400 Fax: 408-970-3403 http://www.mellanox.com Scientists, engineers and analysts
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationHow HPC Hardware and Software are Evolving Towards Exascale
How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationEvaluation of Intel Memory Drive Technology Performance for Scientific Applications
Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing
More informationTuning Alya with READEX for Energy-Efficiency
Tuning Alya with READEX for Energy-Efficiency Venkatesh Kannan 1, Ricard Borrell 2, Myles Doyle 1, Guillaume Houzeaux 2 1 Irish Centre for High-End Computing (ICHEC) 2 Barcelona Supercomputing Centre (BSC)
More informationGPU COMPUTING WITH MSC NASTRAN 2013
SESSION TITLE WILL BE COMPLETED BY MSC SOFTWARE GPU COMPUTING WITH MSC NASTRAN 2013 Srinivas Kodiyalam, NVIDIA, Santa Clara, USA THEME Accelerated computing with GPUs SUMMARY Current trends in HPC (High
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationAerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project
Workshop HPC enabling of OpenFOAM for CFD applications Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project A. De Maio (1), V. Krastev (2), P. Lanucara (3),
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationPiz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design
Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More information