Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB)
|
|
- Leon Murphy
- 5 years ago
- Views:
Transcription
1 Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB) G. Wellein, G. Hager, M. Wittmann, J. Habich, J. Treibig Department für Informatik H Services, Regionales Rechenzentrum Erlangen Friedrich-Alexander-Universität Erlangen-Nürnberg
2 SKALB project survey artners: TU Braunschweig TU Dortmund / IANUS HLRS Stuttgart Erlangen: LSS / RRZE (projectlead) Ass. partners: Intel, RAY, BASF, Sulzer, hhpberlin, H Goal: Efficient implementation of Lattice-Boltzmann FD solvers for complex multi-physics applications on petascale supercomputers Main tasks of RRZE: Multicore-specific implementation & optimization of kernels & full applications. Static domain decomposition Benchmarking and showcases (BMBF all H-Software für skalierbare arallelrechner ) Funding: 1,8 M 22. Juni 2010 SKALB Multicore-aware parallelization 2
3 Multicore-aware parallelization approach: Efficient, parallel, multicore-aware temporal blocking roof of concept: Simple iterative method w. regular stencil updates (Jacobi) Elaborate spatial blocking techniques have been implemented (not shown) Socket-/node level implementation + multi-halo data exchange for multi-node computations 1. M. Wittmann, G. Hager, J. Treibig and G. Wellein: Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. Submitted to arallel rocessing Letters. arxiv: J. Treibig, G. Wellein and G. Hager: Efficient multicore-aware parallelization strategies for iterative stencil computations. Submitted to Journal of omputational Science. arxiv: M. Wittmann, G. Hager and G. Wellein: Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory. Accepted for roceedings of LS10, the Workshop on Large- Scale arallel rocessing at IDS 2010, April 23rd, 2010, Atlanta, GA. arxiv: G. Wellein, G. Hager, T. Zeiser, M. Wittmann, H. Fehske: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. roceedings of rd Annual IEEE International omputer Software and Applications onference (OMSA2009), IEEE omputer Society, p DOI /OMSA BEST AER AWARD 22. Juni 2010 SKALB Multicore-aware parallelization 3
4 The multicore (r)evolution A standard compute node 2-socket shared-memory node of cache-coherent Non-Uniform Memory Access (ccnuma) type Shared aches: MI L3 AHE Memory L3 AHE MI Memory Data exchange Thread synchronization Data oherency! L3 cache traffic? Scalable? MI parallel? AMD Opteron Istanbul 6 cores@2.8 GHz L1: 64 KB / L2: 512 KB / L3: 6 MB 2 X 2 X DDR GB/s HT GB/s/dir Intel Xeon Westmere 6 cores@2.93 GHz L1: 32 KB; L2: 256 KB; L3: 12MB 2 X 3 X DDR GB/s 2 X QI GB/s/dir 22. Juni 2010 SKALB Multicore-aware parallelization 4
5 Multicore-aware parallelization Jacobi solver a simple prototype do k = 1, N do j = 1, N do i = 1, N y(i,j,k) = b* (x(i-1,j,k)+x(i+1,j,k)+ x(i,j-1,k)+x(i,j+1,k)+ x(i,j,k-1)+x(i,j,k+1)) enddo enddo enddo A simple but representative test erformance Metric: Million Lattice Site Updates per second (MLUs) Equivalent MFLOs: 6 FLO/LU * MLUs Bandwidth requirements: 16 Byte / Lattice Site Update (LU) if write allocate on y is surpressed (24 Byte/LU otherwise) erformance estimate for best baseline: B M / (16 Byte/LU) (B M : attainable memory bandwidth as measured with stream) 22. Juni 2010 SKALB Multicore-aware parallelization 5
6 Multicore-aware parallelization Towards multicore-awareness: Naive / calssical approach core0 core1 ache Memory x j-direction k-direction do t=1,t Max!$OM ARALLEL DO private( ) do k=1,n do j=1,n do i=1,n y(i,j,k) = enddo enddo enddo y x enddo 22. Juni 2010 SKALB Multicore-aware parallelization 6
7 Multicore-aware parallelization Reuse of in-cache data between threads: wavefronts Reduce main memory accesses by a factor 2! y(:,:,:) is obsolete! core0 core1 Save main memory data transfers for y(:,:,:)! y-direction z-direction tmp(:,:,3) Memory x(:,:,:) Use ring buffer tmp(:,:, 0:3) which fits into the cache Sync threads/cores after each k-iteration core0: x(:,:,k-1:k+1) t tmp(:,:,mod(k,4)) core1: tmp(:,:,mod(k-3,4):mod(k-1,4)) x(:,:,k-2) t Juni 2010 SKALB Multicore-aware parallelization 7
8 Multicore-aware parallelization Wavefront parallelization: Multiple updates 1 x 4 distribution core0 core1 core2 core3 tmp1(0:3) tmp2(0:3) tmp3(0:3) x( :, :, : ) Running tb wavefronts requires tb-1 temporary arrays tmp to be held in cache! Max. performance gain: tb = 4 But extensive use of cache bandwidth! 22. Juni 2010 SKALB Multicore-aware parallelization 8
9 Multicore-aware parallelization Wavefront Jacobi on state-of-the art multicores B olc ~ 10 MI B olc ~ 2-3 MI B olc ~ 10 ompare against optimal baseline! erformance gain ~ B olc = L3 bandwidth / memory bandwidth MI 22. Juni 2010 SKALB Multicore-aware parallelization 9
10 Multicore-aware programming First lectures on the node level Fast shared on-chip resources: Room for new ideas! Fast thread synchronisation through shared caches Wavefront works fine with Intel compilers but fails with gcc: OM BARRIER: 4-core ore i7: 11,000 cycles (gcc 4.3.3) vs. 800 cycles (icc 11.0) Ability to exploit parallel structures at finer granularity (shorter loops, frequent synchronisation) Technique can easily be extended to many regular stencil based iterative methods, e.g. Gauß-Seidel ( done) Lattice-Boltzmann flow solvers ( work in progress) Multigrid ( BMBF proposal EMMMA) For multi-node applications an hybrid MI+OpenM approach needs to e implemented (next slides) 22. Juni 2010 SKALB Multicore-aware parallelization 10
11 Multicore-aware programming Multi-layer halo exchange Temporal blocking requires multi-layer halo Using diagonal communication elimination (DE) (Ding/He S 2001) Exchanging halo with neighbors done only along the coordinate directions 2 More complex stencils, e.g. occurring at lattice Boltzmann methods, need more attention for deciding which data to communicate Juni 2010 SKALB Multicore-aware parallelization 11
12 Multicore-aware programming Impact of Multi-layer Halo on erformance Assumptions for model: No overlap between communication and computation QDR-InfiniBand 3.2 GB/s 1.8 µs latency Node performance 2 GLU/s Reduced latency by message aggregation strong scaling Degrade due to halo work No impact for large domain sizes 22. Juni 2010 SKALB Multicore-aware parallelization 12
13 Multicore-aware programming Strong Scaling Nehalem E cluster QDR Infiniband Domain size: GB Benefit of temporal blocking eaten by communication overhead (~ 60-70%) ipelining 1N suffers from bulksynchronized communication 22. Juni 2010 SKALB Multicore-aware parallelization 13
14 Multicore-aware programming Weak Scaling Nehalem E luster QDR Infiniband Subdomain size per node: Benefit from temporal blocking can be maintained 22. Juni 2010 SKALB Multicore-aware parallelization 14
15 arallel GU computing with Lattice-Boltzmann solvers a focus of SKALB
16 reliminary comparison Blue-Gene/ GU-luster 96 GUs: NVIDIA Tesla 24 host nodes D3Q13 discretization #cores precision # grid nodes Blue-Gene/ double 10 9 ~20 GLUS Ludwig (96 GUs) single 3.2 x 10 9 ~45 GLUS absolute performance M. Krafczyk slide 16 ih Gauß-Allianz H Status onference, June 22-24,2010
17 WALBERLA: Block-/patch-structured heterogeneous U/GU D3Q19 Halo exchange between blocks Single precision computations GU cluster at HLRS Blocks GU: 1 GU: 22, U:1 Nodes rocesses 2 x GU 60 x GU 2 x GU + 60 x GU + 60 x GU + 60 x GU + 6 x U 180 x U 420 x U 660 x U MFLUS ~3 TFlop/s 22. Juni 2010 SKALB Multicore-aware parallelization 17
18 Besten Dank Financial support through KONWIHR-II projects OMI4papps, BMBF roject SKALB (01 IH08003A) Juni 2010 SKALB Multicore-aware parallelization 18
Performance Engineering - Case study: Jacobi stencil
Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.
More informationCommon lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases
Hybrid (i.e. MPI+OpenMP) applications (i.e. programming) on modern (i.e. multi- socket multi-numa-domain multi-core multi-cache multi-whatever) architectures: Things to consider Georg Hager Gerhard Wellein
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationMPI+OpenMP hybrid computing
MPI+OpenMP hybrid computing (on modern multicore systems) Georg Hager Gerhard Wellein Holger Stengel Jan Treibig Markus Wittmann Michael Meier Erlangen Regional Computing Center (RRZE), Germany 39th Speedup
More informationQuantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model
ERLANGEN REGIONAL COMPUTING CENTER Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Holger Stengel, J. Treibig, G. Hager, G. Wellein Erlangen Regional
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationEfficient numerical simulation on multicore processors (MuCoSim)
Efficient numerical simulation on multicore processors (MuCoSim) 13.10.2015 Prof. Gerhard Wellein, Dr. G. Hager Department für Informatik & HPC Services Regionales Rechenzentrum Erlangen (RRZE) http://moodle.rrze.uni-erlangen.de/course/view.php?id=340
More informationProf. Gerhard Wellein, Dr. Georg Hager Erlangen Regional Computing Center. Universität Erlangen-Nürnberg. Compact Course, KTH Stockholm, March 2011
Efficient multithreaded programming on modern Us and GUs Multicore performance issues rof. Gerhard Wellein, Dr. Georg Hager Erlangen Regional omputing enter Department for omputer Science Universität Erlangen-Nürnberg
More informationBasics of performance modeling for numerical applications: Roofline model and beyond
Basics of performance modeling for numerical applications: Roofline model and beyond Georg Hager, Jan Treibig, Gerhard Wellein SPPEXA PhD Seminar RRZE April 30, 2014 Prelude: Scalability 4 the win! Scalability
More informationGeometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion
Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationarxiv: v1 [cs.pf] 10 Apr 2010
Efficient multicore-aware parallelization strategies for iterative stencil computations Jan Treibig, Gerhard Wellein, Georg Hager Erlangen Regional Computing Center (RRZE) University Erlangen-Nuremberg
More informationApplication Performance in the Multi-Core Era. Lessons to Be Learned for Exascale Computing
Application Performance in the Multi-ore Era Lessons to Be Learned for Exascale omputing ( Tales from the trenches ) Gerhard Wellein Department for omputer Science Erlangen Regional omputing enter Friedrich-Alexander-University
More informationIntroducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method
Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationProgramming Techniques for Supercomputers
Programming Techniques for Supercomputers Prof. Dr. G. Wellein (a,b) Dr. G. Hager (a) Dr.-Ing. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum Erlangen (b) Department für Informatik University
More informationThe ECM (Execution-Cache-Memory) Performance Model
The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore
More informationAnalytical Tool-Supported Modeling of Streaming and Stencil Loops
ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August
More informationThe walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms
The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms Harald Köstler, Uli Rüde (LSS Erlangen, ruede@cs.fau.de) Lehrstuhl für Simulation Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de
More informationGerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1
Parallel lsparse matrix-vector ti t multiplication li as a test case for hybrid MPI+OpenMP programming Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1 Institute of Physics, University
More informationPerformance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures
Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund
More informationMultigrid algorithms on multi-gpu architectures
Multigrid algorithms on multi-gpu architectures H. Köstler European Multi-Grid Conference EMG 2010 Isola d Ischia, Italy 20.9.2010 2 Contents Work @ LSS GPU Architectures and Programming Paradigms Applications
More informationCommunication-Avoiding Optimization of Geometric Multigrid on GPUs
Communication-Avoiding Optimization of Geometric Multigrid on GPUs Amik Singh James Demmel, Ed. Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-258
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationPeta-Scale Simulations with the HPC Software Framework walberla:
Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,
More informationwalberla: Developing a Massively Parallel HPC Framework
walberla: Developing a Massively Parallel HPC Framework SIAM CS&E 2013, Boston February 26, 2013 Florian Schornbaum*, Christian Godenschwager*, Martin Bauer*, Matthias Markl, Ulrich Rüde* *Chair for System
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationCAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center
AMA: Modern processors Memory hierarchy: aches Gerhard Wellein, Department for omputer Science and Erlangen Regional omputing enter Johannes Hofmann/Dietmar Fey, Department for omputer Science University
More informationKommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen
Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner rabenseifner@hlrs.de Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS)
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationGeneration of Multigrid-based Numerical Solvers for FPGA Accelerators
Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System
More informationarxiv: v2 [cs.dc] 7 Jan 2013
LIKWID: Lightweight Performance Tools Jan Treibig, Georg Hager, and Gerhard Wellein arxiv:1104.4874v2 [cs.dc] 7 Jan 2013 Abstract Exploiting the performance of today s microprocessors requires intimate
More informationProgramming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy
Programming Techniques for Supercomputers: Modern processors Architecture of the memory hierarchy Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), Dr. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum
More informationMulticore Scaling: The ECM Model
Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationarxiv: v1 [cs.dc] 3 Nov 2011
Comparison of different Propagation Steps for the Lattice Boltzmann Method Markus Wittmann a, Thomas Zeiser a, Georg Hager a, Gerhard Wellein b a Erlangen Regional Computing Center (RRZE), University of
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationPerformance Evaluation of Current HPC Architectures Using Low-Level Level and Application Benchmarks
Performance Evaluation of urrent HP Architectures Using Low-Level Level and Application Benchmarks G. Hager, H. Stengel, T. Zeiser,, G. Wellein Regionales Rechenzentrum Erlangen (RRZE) Friedrich-Alexander
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationPlatforms Design Challenges with many cores
latforms Design hallenges with many cores Raj Yavatkar, Intel Fellow Director, Systems Technology Lab orporate Technology Group 1 Environmental Trends: ell 2 *Other names and brands may be claimed as the
More informationArchitecture Aware Multigrid
Architecture Aware Multigrid U. Rüde (LSS Erlangen, ruede@cs.fau.de) joint work with D. Ritter, T. Gradl, M. Stürmer, H. Köstler, J. Treibig and many more students Lehrstuhl für Informatik 10 (Systemsimulation)
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationPerformance. Computing (UCHPC)) for Finite Element Simulations
technische universität dortmund Universität Dortmund fakultät für mathematik LS III (IAM) UnConventional High Performance Computing (UCHPC)) for Finite Element Simulations S. Turek, Chr. Becker, S. Buijssen,
More informationOverlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu
More informationComputing on GPU Clusters
Computing on GPU Clusters Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationPerformance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla
Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer,
More informationIdentifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning
Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)
More informationarxiv: v1 [cs.pf] 5 Dec 2011
Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results J. Habich a, C. Feichtinger b, H. Köstler b, G. Hager a, G. Wellein a,b a Erlangen
More informationNetwork Bandwidth & Minimum Efficient Problem Size
Network Bandwidth & Minimum Efficient Problem Size Paul R. Woodward Laboratory for Computational Science & Engineering (LCSE), University of Minnesota April 21, 2004 Build 3 virtual computers with Intel
More informationCray XE6 Performance Workshop
Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed
More informationHigh Performance Computing (HPC) Introduction
High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationAutomatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014
Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMassively Parallel Phase Field Simulations using HPC Framework walberla
Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationParallel sparse matrix-vector multiplication as a test case for hybrid MPI+OpenMP programming
arallel sparse matrix-vector multiplication as a test case for hybrid MI+OpenM programming Gerald Schubert 1, Georg Hager 1, Holger Fehske 2, Gerhard Wellein 1,3 1 Erlangen Regional Computing Center, 9158
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationModern computer architecture. From multicore to petaflops
Modern computer architecture From multicore to petaflops Motivation: Multi-ores where and why Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationOptimization of finite-difference kernels on multi-core architectures for seismic applications
Optimization of finite-difference kernels on multi-core architectures for seismic applications V. Etienne 1, T. Tonellot 1, K. Akbudak 2, H. Ltaief 2, S. Kortas 3, T. Malas 4, P. Thierry 4, D. Keyes 2
More informationEfficient implementation of simple lattice Boltzmann kernels
Survey Introduction Efficient implementation of simple lattice Boltzmann kernels Memory hierarchies of modern processors Implementation of lattice Boltzmann method Optimization of data access for 3D lattice
More information4. Shared Memory Parallel Architectures
Master rogram (Laurea Magistrale) in Computer cience and Networking High erformance Computing ystems and Enabling latforms Marco Vanneschi 4. hared Memory arallel Architectures 4.4. Multicore Architectures
More informationCRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar
CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationCAF versus MPI Applicability of Coarray Fortran to a Flow Solver
CAF versus MPI Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller m.hasert@grs-sim.de Applied Supercomputing in Engineering Motivation We develop several CFD
More informationFrom Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp
From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with andreas.schaefer@cs.fau.de Friedrich-Alexander-Universität Erlangen-Nürnberg GPU Technology Conference 2013, San José,
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationComputational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm
Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm March 17 March 21, 2014 Florian Schornbaum, Martin Bauer, Simon Bogner Chair for System Simulation Friedrich-Alexander-Universität
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationLixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationReconstruction of Trees from Laser Scan Data and further Simulation Topics
Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationEuropean exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE
European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE Thomas Gerhold, Barbara Brandfass, Jens Jägersküpper, DLR Christian Simmendinger,
More informationLecture 1: Introduction
Lecture 1: Introduction ourse organization: 13 lectures on parallel architectures ~5 lectures on cache coherence, consistency ~3 lectures on TM ~2 lectures on interconnection networks ~2 lectures on large
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationChallenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang
Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More information