Performance potential for simulating spin models on GPU

Similar documents
CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Optimisation Myths and Facts as Seen in Statistical Physics

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

General Purpose GPU Computing in Partial Wave Analysis

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

High Performance Computing on GPUs using NVIDIA CUDA

Tesla Architecture, CUDA and Optimization Strategies

CUDA Programming Model

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Numerical Simulation on the GPU

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

POST-SIEVING ON GPUs

TUNING CUDA APPLICATIONS FOR MAXWELL

Using GPUs to compute the multilevel summation of electrostatic forces

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

TUNING CUDA APPLICATIONS FOR MAXWELL

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

high performance medical reconstruction using stream programming paradigms

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Technology for a better society. hetcomp.com

Introduction to GPU hardware and to CUDA

Massively Parallel Architectures

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Portland State University ECE 588/688. Graphics Processors

How to Optimize Geometric Multigrid Methods on GPUs

Accelerating CFD with Graphics Hardware

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Tesla GPU Computing A Revolution in High Performance Computing

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Experts in Application Acceleration Synective Labs AB

RNG: definition. Simulating spin models on GPU Lecture 3: Random number generators. The story of R250. The story of R250

Graphics Processor Acceleration and YOU

CS516 Programming Languages and Compilers II

CUDA Performance Optimization. Patrick Legresley

Addressing Heterogeneity in Manycore Applications

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Parallel Computing. Hwansoo Han (SKKU)

Multi-Kepler GPU vs. Multi-Intel MIC for spin systems simulations

Real-time Graphics 9. GPGPU

Warps and Reduction Algorithms

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Accelerating image registration on GPUs

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Computer Architecture

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Lecture 1: Gentle Introduction to GPUs

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

NVIDIA Fermi Architecture

GPUfs: Integrating a file system with GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Real-time Graphics 9. GPGPU

Scientific Computing on GPUs: GPU Architecture Overview

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Programmable Graphics Hardware (GPU) A Primer

ECE 574 Cluster Computing Lecture 16

Introduction to GPGPU and GPU-architectures

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

JCudaMP: OpenMP/Java on CUDA

1/25/12. Administrative

Introduction to CUDA

CS427 Multicore Architecture and Parallel Computing

Mathematical computations with GPUs

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Flexible Architecture Research Machine (FARM)

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Accelerating NAMD with Graphics Processors

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

CUDA (Compute Unified Device Architecture)

Introduction to CUDA Programming

Optimization solutions for the segmented sum algorithmic function

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Tesla GPU Computing A Revolution in High Performance Computing

Tuning CUDA Applications for Fermi. Version 1.2

Multi-Processors and GPU

Introduction to GPU computing

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

Chapter 2 Parallel Hardware

Modern Processor Architectures. L25: Modern Compiler Design

From Brook to CUDA. GPU Technology Conference

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Exotic Methods in Parallel Computing [GPU Computing]

Transcription:

Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational Physics Leipzig, November 25 27, 2010 M. Weigel (Mainz) Spin models on GPU CompPhys10 1 / 26

GPU computation frameworks GPGPU = General Purpose Computation on Graphics Processing Unit Old times: use original graphics primitives OpenGL DirectX Vendor specific APIs for GPGPU: NVIDIA CUDA: library of functions performing computations on GPU (C, C++, Fortran), additional preprocessor with language extensions ATI/AMD Stream: similar functionality for ATI GPUs Device independent schemes: BrookGPU (Standford University): compiler for the Brook stream program language with backends for different hardware; now merged with AMD Stream Sh (University of Waterloo): metaprogramming language for programmable GPUs OpenCL (Open Computing Language): open framework for parallel programming across a wide range of devices, ranging from CPUs, Cell processors and GPUs to handheld devices M. Weigel (Mainz) Spin models on GPU CompPhys10 2 / 26

NVIDIA architecture Host memory... Multiprocessor n Multiprocessor 2 Multiprocessor 1 Shared memory Device memory (Global memory) Device (GPU) REG Processor 1 REG Processor 2... REG Processor m Instruction Unit Constant cache Texture cache M. Weigel (Mainz) Spin models on GPU CompPhys10 3 / 26

NVIDIA architecture Host (CPU) Grid 1 Device (GPU) Kernel 1 Block (0,0) Thread (0,0) Thread (1,0) Block (1,0) Thread (0,0) Thread (1,0) Block (2,0) Thread (0,0) Thread (1,0) Thread (0,1) Thread (1,1) Thread (0,1) Thread (1,1) Thread (0,1) Thread (1,1) Block (0,1) Block (1,1) Block (2,1) Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0) Thread (0,1) Thread (1,1) Thread (0,1) Thread (1,1) Thread (0,1) Thread (1,1) Kernel 2 Grid 2 M. Weigel (Mainz) Spin models on GPU CompPhys10 3 / 26

NVIDIA architecture Memory layout: Registers: each multiprocessor is equipped with several thousand registers with local, zero-latency access Shared memory: processors of a multiprocessor have access a small amount (16 KB for Tesla, 48 KB for Fermi) of on chip, very small latency shared memory Global memory: large amount (currently up to 4 GB) of memory on separate DRAM chips with access from every thread on each multiprocessor with a latency of several hundred clock cycles Constant and texture memory: read-only memories of the same speed as global memory, but cached Host memory: cannot be accessed from inside GPU functions, relatively slow transfers M. Weigel (Mainz) Spin models on GPU CompPhys10 3 / 26

Spin models Consider classical spin models with nn interactions, in particular Ising model H = J ij s i s j + H i s i, s i = ±1 Heisenberg model H = J ij #» S i #» S j + H #» #» S i, #» S i = 1 i Edwards-Anderson spin glass H = ij J ij s i s j, s i = ±1 M. Weigel (Mainz) Spin models on GPU CompPhys10 4 / 26

Metropolis simulations Computations need to be organized to suit the GPU layout for maximum performance: a large degree of locality of the calculations, reducing the need for communication between threads a large coherence of calculations with a minimum occurrence of divergence of the execution paths of different threads a total number of threads significantly exceeding the number of available processing units a large overhead of arithmetic operations and shared memory accesses over global memory accesses M. Weigel (Mainz) Spin models on GPU CompPhys10 5 / 26

Metropolis simulations Computations need to be organized to suit the GPU layout for maximum performance: a large degree of locality of the calculations, reducing the need for communication between threads a large coherence of calculations with a minimum occurrence of divergence of the execution paths of different threads a total number of threads significantly exceeding the number of available processing units a large overhead of arithmetic operations and shared memory accesses over global memory accesses Consequences for (Metropolis) simulations: best to use an independent RNG per thread need to make sure that sequences are uncorrelated divide system into independent tiles level-1 checkerboard each tile should fit into shared memory divide tile (again) in checkboard fashion for parallel update with different threads level-2 checkerboard M. Weigel (Mainz) Spin models on GPU CompPhys10 5 / 26

Checkerboard decomposition (red) large tiles: thread blocks (red) small tiles: individual threads load one large tile (plus boundary) into shared memory perform several spin updates per tile M. Weigel (Mainz) Spin models on GPU CompPhys10 6 / 26

Checkerboard decomposition (red) large tiles: thread blocks (red) small tiles: individual threads load one large tile (plus boundary) into shared memory perform several spin updates per tile M. Weigel (Mainz) Spin models on GPU CompPhys10 6 / 26

Performance How to assess performance? what to compare to (one CPU core, whole CPU, SMP system,...) here: Tesla C1060 vs. Intel QuadCore (Yorkfield) @ 3.0 GHz/6 MB for really fair comparison: optimize CPU code for cache alignment, use SSE instructions etc. ignore measurements, since spin flips per µs, (ns, ps) is well-established unit for spin systems M. Weigel (Mainz) Spin models on GPU CompPhys10 7 / 26

Performance How to assess performance? what to compare to (one CPU core, whole CPU, SMP system,...) here: Tesla C1060 vs. Intel QuadCore (Yorkfield) @ 3.0 GHz/6 MB for really fair comparison: optimize CPU code for cache alignment, use SSE instructions etc. ignore measurements, since spin flips per µs, (ns, ps) is well-established unit for spin systems Example: Metropolis simulation of 2D Ising system use 32-bit linear congruential generator no neighbor table since integer multiplies and adds are very cheap (4 instructions per clock cycle and processor) need to play with tile sizes to achieve best throughput M. Weigel (Mainz) Spin models on GPU CompPhys10 7 / 26

2D Ising ferromagnet tflip [ns] 10 1 10 0 T = 4 T = 8 T = 16 T = 32 10 1 16 32 64 128 256 512 1024 2048 4096 8192 16384 L M. Weigel (Mainz) Spin models on GPU CompPhys10 8 / 26

2D Ising ferromagnet tflip [ns] 10 1 10 0 CPU k = 1 k = 100 k = 100, Fermi 10 1 32 64 128 256 512 1024 2048 4096 8192 16384 L M. Weigel (Mainz) Spin models on GPU CompPhys10 8 / 26

2D Ising ferromagnet tflip [ns] 10 1 10 0 Fibonacci LCG64 LCG32 with measurements LCG32 10 1 32 64 128 256 512 1024 2048 4096 8192 16384 L M. Weigel (Mainz) Spin models on GPU CompPhys10 8 / 26

A closer look Comparison to exact results: 3 c V 2 1 exact 100 10 1 0 0.2 0.3 0.4 0.5 0.6 0.7 β M. Weigel (Mainz) Spin models on GPU CompPhys10 9 / 26

A closer look Random number generators: significant deviations from exact result for test case of 1024 1024 system at β = 0.4, 10 7 sweeps checkerboard update uses random numbers in different way than sequential update linear congruential generators can skip ahead: right way uses non-overlapping sub-sequences wrong way uses sequences from random initial seeds, many of which must overlap M. Weigel (Mainz) Spin models on GPU CompPhys10 9 / 26

A closer look Random number generators: significant deviations from exact result for test case of 1024 1024 system at β = 0.4, 10 7 sweeps method e rel C V rel exact 1.106079207 0 0.8616983594 0 sequential update (CPU) LCG32 1.1060788(15) 0.26 0.83286(45) 63.45 LCG64 1.1060801(17) 0.49 0.86102(60) 1.14 Fibonacci, r = 512 1.1060789(17) 0.18 0.86132(59) 0.64 checkerboard update (GPU) LCG32 1.0944121(14) 8259.05 0.80316(48) 121.05 LCG32, random 1.1060775(18) 0.97 0.86175(56) 0.09 LCG64 1.1061058(19) 13.72 0.86179(67) 0.14 LCG64, random 1.1060803(18) 0.62 0.86215(63) 0.71 Fibonacci, r = 512 1.1060890(15) 6.43 0.86099(66) 1.09 Fibonacci, r = 1279 1.1060800(19) 0.40 0.86084(53) 1.64 M. Weigel (Mainz) Spin models on GPU CompPhys10 9 / 26

A closer look Speedups: In two dimensions: 0.076 ns on GPU vs. 8 ns on CPU: factor 105 0.034 ns on Fermi GPU: factor 235 CPU code up to 10 times faster, GPU code up to 9 times faster than that used in T. Preis, P. Virnau, W. Paul, J. J. Schneider, J. Comput. Phys. 228, 4468 (2009) In three dimensions: 0.13 ns vs. 14 ns on CPU: factor 110 0.067 ns on Fermi GPU: factor 210 M. Weigel (Mainz) Spin models on GPU CompPhys10 9 / 26

Heisenberg model Maximum performance around 100 ps per spin flip for Ising model (vs. around 10 ns on CPU). What about continuous spins, i.e., float instead of int variables? M. Weigel (Mainz) Spin models on GPU CompPhys10 10 / 26

Heisenberg model Maximum performance around 100 ps per spin flip for Ising model (vs. around 10 ns on CPU). What about continuous spins, i.e., float instead of int variables? use same decomposition, but now floating-point computations are dominant: CUDA is not 100% IEEE compliant single-precision computations are supposed to be fast, double precision (supported since recently) much slower for single precision, normal ( high precision ) and extra-fast, device-specific versions of sin, cos, exp etc. are provided M. Weigel (Mainz) Spin models on GPU CompPhys10 10 / 26

Heisenberg model: performance 10 2 time per spin flip in ns 10 1 10 0 CPU single 5 sweeps 100 sweeps 10 1 16 32 64 128 256 512 1024 L M. Weigel (Mainz) Spin models on GPU CompPhys10 11 / 26

Heisenberg model: performance 10 2 time per spin flip in ns 10 1 10 0 CPU single float high precision float low precision 10 1 16 32 64 128 256 512 1024 L M. Weigel (Mainz) Spin models on GPU CompPhys10 11 / 26

Heisenberg model: performance 10 2 time per spin flip in ns 10 1 10 0 CPU double CPU single double float high precision float low precision 10 1 16 32 64 128 256 512 1024 L M. Weigel (Mainz) Spin models on GPU CompPhys10 11 / 26

Heisenberg model: performance 500 400 float low precision float high precision double speedup 300 200 100 0 16 32 64 128 256 512 1024 L M. Weigel (Mainz) Spin models on GPU CompPhys10 11 / 26

Heisenberg model: stability Performance results: CPU: 185 ns (single) resp. 264 (double) per spin flip GPU: 0.8 ns (single), 0.4 ns (fast single) resp. 5.3 ns (double) per spin flip M. Weigel (Mainz) Spin models on GPU CompPhys10 12 / 26

Heisenberg model: stability Performance results: CPU: 185 ns (single) resp. 264 (double) per spin flip GPU: 0.8 ns (single), 0.4 ns (fast single) resp. 5.3 ns (double) per spin flip How about stability? 10 6 10 7 #» S i 1 10 8 10 9 10 10 10 11 float low precision CPU single float high precision 10 12 16 32 64 128 256 512 1024 L M. Weigel (Mainz) Spin models on GPU CompPhys10 12 / 26

Cluster algorithms Would need to use cluster algorithms for efficient equilibrium simulation of spin models at criticality: 1 Activate bonds between like spins with probability p = 1 e 2βJ. 2 Construct (Swendsen-Wang) spin clusters from domains connected by active bonds. 3 Flip independent clusters with probability 1/2. 4 Goto 1. M. Weigel (Mainz) Spin models on GPU CompPhys10 13 / 26

Critical configuration M. Weigel (Mainz) Spin models on GPU CompPhys10 14 / 26

Cluster algorithms Would need to use cluster algorithms for efficient equilibrium simulation of spin models at criticality: 1 Activate bonds between like spins with probability p = 1 e 2βJ. 2 Construct (Swendsen-Wang) spin clusters from domains connected by active bonds. 3 Flip independent clusters with probability 1/2. 4 Goto 1. Steps 1 and 3 are local Can be efficiently ported to GPU. What about step 2? Domain decomposition into tiles. labeling inside of domains Hoshen-Kopelman breadth-first search self-labeling union-find algorithms relabeling across domains self-labeling hierarchical approach iterative relaxation M. Weigel (Mainz) Spin models on GPU CompPhys10 15 / 26

BFS or Ants in the Labyrinth 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 19 45 46 47 32 33 19 19 19 19 38 39 24 25 26 19 19 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 M. Weigel (Mainz) Spin models on GPU CompPhys10 16 / 26

BFS or Ants in the Labyrinth 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 19 45 46 47 32 33 19 19 19 19 38 39 24 25 26 19 19 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 only wave-front vectorization would be possible many idle threads M. Weigel (Mainz) Spin models on GPU CompPhys10 16 / 26

Self-labeling 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 24 25 26 19 27 28 29 30 31 16 17 18 19 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 M. Weigel (Mainz) Spin models on GPU CompPhys10 17 / 26

Self-labeling 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 24 25 26 19 27 28 29 30 31 16 17 18 19 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 effort is O(L 3 ) at the critical point, but can be vectorized with O(L 2 ) threads M. Weigel (Mainz) Spin models on GPU CompPhys10 17 / 26

Union-find 56 57 58 59 60 61 62 63 48 41 41 51 52 53 54 55 40 32 41 41 19 45 46 47 32 32 19 30 30 30 38 39 24 25 26 19 30 30 13 31 16 17 18 19 20 21 13 23 8 9 10 11 12 13 13 15 0 1 2 3 4 5 6 7 M. Weigel (Mainz) Spin models on GPU CompPhys10 18 / 26

Union-find 56 57 58 59 60 61 62 63 48 41 41 51 52 53 54 55 40 32 41 41 19 45 46 47 32 32 19 30 30 30 38 39 24 25 26 19 30 30 13 31 16 17 18 19 20 21 13 23 8 9 10 11 12 13 13 15 0 1 2 3 4 5 6 7 tree structure with two optimizations: balanced trees path compression root finding and cluster union essentially O(1) operations M. Weigel (Mainz) Spin models on GPU CompPhys10 18 / 26

Performance 600 500 400 self-labeling, global union-find, hierarchical union-find, relaxation self-labeling, relaxation CPU ns/flip 300 200 100 0 32 64 128 256 512 1024 2048 L M. Weigel (Mainz) Spin models on GPU CompPhys10 19 / 26

Performance 400 300 flip relax setup boundary label local bonds ns/flip 200 100 0 32 64 128 256 512 1024 2048 L M. Weigel (Mainz) Spin models on GPU CompPhys10 19 / 26

Performance ns/flip 400 300 200 flip relax setup boundary label local bonds CPU 100 0 32 64 128 256 512 1024 2048 L M. Weigel (Mainz) Spin models on GPU CompPhys10 19 / 26

Performance Problems with cluster labeling on GPU: overhead from parallelization (relaxation steps) lack of thread-level parallelism idle threads in hierarchical schemes best performance about 29 ns per spin flip, improvements possible problems not due to type of computations: 2.9 ns per spin flip for SW simulations of several systems in parallel M. Weigel (Mainz) Spin models on GPU CompPhys10 20 / 26

Spin glasses Simulate Edwards-Anderson model on GPU: same domain decomposition (checkerboard) slightly bigger effort due to non-constant couplings higher performance due to larger independence? very simple to combine with parallel tempering M. Weigel (Mainz) Spin models on GPU CompPhys10 21 / 26

Spin glass: performance 10 2 time per spin flip in ns 10 1 10 0 10 1 10 2 CPU GPU 1 10 100 1000 10000 # replicas M. Weigel (Mainz) Spin models on GPU CompPhys10 22 / 26

Spin glass: performance 100 80 speedup 60 40 20 0 1 10 100 1000 10000 # replicas M. Weigel (Mainz) Spin models on GPU CompPhys10 22 / 26

Spin glasses: continued Seems to work well with 15 ns per spin flip on CPU 180 ps per spin flip on GPU but not better than ferromagnetic Ising model. M. Weigel (Mainz) Spin models on GPU CompPhys10 23 / 26

Spin glasses: continued Seems to work well with 15 ns per spin flip on CPU 180 ps per spin flip on GPU but not better than ferromagnetic Ising model. Further improvement: use multi-spin coding Synchronous multi-spin coding: different spins in a single configurations in one word Asynchronous multi-spin coding: spins from different realizations in one word M. Weigel (Mainz) Spin models on GPU CompPhys10 23 / 26

Spin glasses: continued Seems to work well with 15 ns per spin flip on CPU 180 ps per spin flip on GPU but not better than ferromagnetic Ising model. Further improvement: use multi-spin coding Synchronous multi-spin coding: different spins in a single configurations in one word Asynchronous multi-spin coding: spins from different realizations in one word brings us down to about 15 ps per spin flip M. Weigel (Mainz) Spin models on GPU CompPhys10 24 / 26

Janus JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. M. Weigel (Mainz) Spin models on GPU CompPhys10 25 / 26

Janus JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. M. Weigel (Mainz) Spin models on GPU CompPhys10 25 / 26

Janus JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. Costs: Janus: 256 units, total cost about 700, 000 Euros Same performance with GPU: 64 PCs (2000 Euros) with 2 GTX 295 cards (500 Euros) 200, 000 Euros Same performance with CPU only (assuming a speedup of 50): 800 blade servers with two dual Quadcore sub-units (3500 Euros) 2, 800, 000 Euros M. Weigel (Mainz) Spin models on GPU CompPhys10 25 / 26

Outlook Conclusions: GPGPU promises significant speedups at moderate coding effort Requirements for good performance: large degree of locality domain decomposition suitability for parallelization (blocks) and vectorization (threads) total number of threads much larger than processing units (memory latency) opportunity for using shared memory performance is memory limited ideally continuous variables maximum speed-up 500 effort significantly smaller than for special-purpose machines GPGPU might be a fashion, but CPU computing goes the same way References: M. Weigel, Simulating spin models on GPU, Comput. Phys. Commun. (2010), in print, Preprint arxiv:1006.3865. M. Weigel, Performance potential for simulating spin models on GPU, Mainz preprint (2010). Code at http://www.cond-mat.physik.uni-mainz.de/~weigel/gpu. GPU workshop: late May, early June 2011, Mainz, Germany M. Weigel (Mainz) Spin models on GPU CompPhys10 26 / 26