From modelisation house To test center. Emmanuel Quémener

Size: px
Start display at page:

Download "From modelisation house To test center. Emmanuel Quémener"

Transcription

1 Presentation of IT resources provided by Blaise Pascal Center From modelisation house To test center Emmanuel Quémener

2 Blaise Pascal Center «House of Modelisation» &... Conferences Trainings Projects Hotel June, /3

3 Centre Blaise Pascal ~ Dryden FR Small illustrative example Nasa X-9 Cell from F- Engine from F- ears from F-6 Studies Flap wings Incidence > «Fly-By-Wire» To recycle, to reuse and explore from new domains June, 3/3

4 From engines to parallel envelope A necessary exploration? Flight Envelope Parallel Envelope Software Hardware Speed/altitude Parallelization/Memory/CPU/PU Load factor Input/Output/Contexts/... June, 4/3

5 CBP also a Test Center Experimental platform with technical benchs Multi-nodes: clusters from 4 to 64 nodes, Nodes/Cores : 4/4, /64, /64, 64/ Multi-cores : more than machines from to cores, Nodes/Servers: from to cores, Workstations: from to 6 cores, types of CPU différents PU & Accelerator : 46 different models of PU (AMD & Nvidia), Intel MIC PPU : 9 ; PU Nvidia : ; PU AMD/ATI : types ; Xeon Phi P Integration : 4 virtual machines : Debian from Lenny to Sid both 3 & 64 bits,... Exotic hardware : 3 machines ARMv under Debian Jessie or Ubuntu 3D Vizualisation: stations, videoprojectors, monitors, 4 glasses 3D Virtual Desktop : up to 3 hosts with xgo/virtuall COMOD with SIDUS : «Compute On My Own Device» for laboratories users SIDUS is «Single Instance Distributing Universal System» alaxy prototype for intensive processing of biological data June, /3

6 Some «Digital Benchs» permanent or ephemeral! Schools in Les Houches about «computational physics» : «standalone» complete virtual machine SIDUS virtual machine Digital Humanities : eography : virtual machines and already operational ones Vizualisation workstations for graphical processing Biology : alaxy gateway (with cluster nodes processing) Processing and display stations for MorphoraphX Server provisioning to study & process under Repeat* with lusterfs, TMPFS & BRD Emmanuel Quemener June, 6/3

7 To what does it serve? Example : Repeat* lusterfs/tmpfs Progression «Input Database Coverage» Progression «Family Refinement» Less is Better! More is Better!. 3 R RamDisk. R3 lusterfs R Ramdisk R3 lusterfs Accélération %.3%.9%.9% Emmanuel Quemener June,.4% /

8 An example of interactions Between Laboratory/CBP/PSMN ive U t ap P d a nt AMD y e u B i pm u Eq Preproduction Laboratory Evaluation Test center Nvidia Sy Kno st wem H s & ow Ne SID tw US or ks Production Hardware CBP / Sidus OS CBP Compute Center Influence the choice of hardware background Hardware LBMC / Sidus OS CBP Emmanuel Quemener June, Hardware PSMN / Sidus OS PSMN /3

9 How to manage + machines? SIDUS! Single Instance Distributing Universal System June, 9/3

10 Exemples of studies : predictions... Overshoot Amdahl Law with Mylq One From T=T+p/N to T=T+c*N+p/N PiMC : parallel execution with C/MPI for x x Mesures 4x Itops (Iterative Operations Per Second) 3.x 3x.x x.x x x Amdahl & Mylq Laws Mylq Law Fit on range [-] Mylq Law Mylq Law Fit on range [-6] Amdahl Law Mylq Law Fit on range [-] Mylq Law Amdahl Law 4x 3x x x Mylq Law Fit on range [-64] Itops (Iterative Operations Per Second) 4.x distributed iterations Mesures PiMC : parallel execution with C/MPI for distributed iterations Parallel Rate (NP in MPI) Emmanuel Quemener June, Parallel Rate (NP in MPI) /3 3 4

11 From (multi many)-cores to myri-alus? A parallel laboratory on a chip! You got from to dozen of thousands of equivalent PU! You got the choice between : CPU MIC PU How to choose and distinguish them for its use? June, /3

12 A long long time ago, between 9... and, variable is frequency! Power vs RPM x in years & stop (and even a drop) June, /3

13 The "engine", source of "power" Forget frequency, change RPM to PR 4. Horse Power vs RPM Pthreads OpenMP 3. MPI OpenCL Intel 3. OpenCL AMD. Theory What "behaviors" for these engines at "load-bearing"? The "engine", a system entangling hardware, OS and software June, 3/3

14 Why is the PU so powerful? Shadering & Matrix Calculation Model World: 3 matrix products Rotation Translation Scalinge MW World View: matrix products Positioning the camera Location Direction View Projection WV VP Pixelisation So a PU: it's a "big" Matrix Multiplier June, 4/3

15 BLAS comparaison : CPU vs PU xemm when x=d(p) or S(P). OpenBLAS DP 9. clblas DP.. 6. Tesla for Pros Thunking DP TX for amers CuBLAS DP OpenBLAS SP clblas SP AMD/ATI For Alt*. 4. Thunking SP CuBLAS SP 3. CPUs AMD. Intel.. m n Ti Ti Ti Ti 69 Tita K P TX TX X k C M la K s la la TX X X X o a TX T T T T TX s l es la Tes Te Tes dr e a T T u Q June, 9 HD r y # no Fu x Na 9 R - 9 R9 Xe on i Ph 9 9 X F n ze Ry /3 3 x x 3 x 3 x 4 x 4 x E6 66 v v v v x E E E E E

16 What Nvidia never broadcasts... xemm on a large domain... Performance Tesla P TX Size Yes, performance is there, but in specific conditions! June, 6/3

17 Which simple code to implement in //? PiMC: Pi by the method of the target Historical example of the Monte Carlo method Parallel implementation: Equivalent distribution of iterations From to 4 parameters Number of iterations Parallel Rate (PR) (Type of variable: INT3, INT64, FP3, FP64) (RN: MWC, CON, SHR3, KISS) simple observables: Estimation of Pi (just indicative, Pi not being rational :-)) Elapsed Time (Individual or lobal) June, /3

18 With QPU, low "regime"... The CPU explodes the PU! From to x slower! TX PiOpenCL PiOCL Base PiOpenCL Boost PiOpenCL TX 9Ti PiOCL Base Tesla K4c PiOpenCL Boost Quadro K4 TX Titan R9-9 R9-9X Xeon Phi E-6v3 E-63v i i 66 v v v3 Ph 9X 9 itan 4c T 3 X eon 9- R9 TX T K4 la K X 9 X E R T T o X s dr E E E Te a u Q E-6v E-66 X E. + E. + E E E E June,. order of magnitude... /3

19 Is Python effective? A quick comparison... MPI+PyOpenCL HT MPI+PyOpenCL NoHT MPI+OpenMP/C MPI/C + E. Itops E+. E+. 4 E+. 6 E+. E+. E+. + E 4. ITOPS 64 nodes with cores 3 implementations ( hybrid) June, 9/3

20 Pi Monte Carlo: the match... Python vs C & CPU vs PU vs Phi E+ E+ E+ E+ Tesla for Pros TX for amers AMD/ATI E+ E+ 6E+ 4E+ E+ CPU E+ i i i n i i i 6 9 m m # ds T 6 # ita T T T T T 3 4 ea 6 X 69 X X T X X X X 9 X S o 4 K K4 C M M la K la K K hr la P T V T T T T T a T N adr dr o dr o TX TX T TX TX TX TX T TX sl sla sla es es sla p es u Te Te Te T T Te + T Q Qua Qua K a l s e T June, T T T T T H C c c c T T 9 U 4 ur y ur y 9 # /C /C P 3 I+ 4 4 oh L H PI P P o F 9 F X M E6 MP v3 v4 M nm X- n D D K R an R R R 9 i L N PyC c e h F m m H H e c X C p P N y I+ D yz 9 tiu ti u 4 el 6 6 4n +O R - R IC en en Int tel tel +P P AM D r6 I M PI c M l P el P X x x In x In te MP e M ia M s t t l r A c 4n lu c In In te ve C 4n n 6 x x x In Ka 6 64 ter er er lus t t s s C lu lu C C D /3

21 Small comparison between * PU Low pararegime: the reign of the CPU à PR June, /3

22 Small comparison between *PU MIC Phi Out of Wood, PU De à 4 de PR June, /3

23 Deep Exploration of PR ParaRégime from 4 to 634 Xeon Phi with 6 cœurs 6x 6x4 6x64 6x64 PR from 4 to 634 AMD HD9 with 4 ALU AMD R9-9X with 6 ALU 4x4 6x4 Nvidia TX Titan & Tesla K4 Prime numbers > 4 June, 3/3

24 Very deep exploration of PRs Regime of //ism from 4 to 636 PR from 4 to 636 AMD HD9 & R9-9 Large Period : 4x number of ALU Period of 4 (4 SMX) Nvidia TX Titan & Tesla K4 Small Period : number of SMX units Period of ( SMX) June, 4/3

25 "Back to the Physics" A Newtonian N-body code... Pass on a code "fine grain" Code N-body very simply integrated Newton's second law, autogravitating system Differential integration methods: Euler Implicite, Euler Explicit, Heun, Runge Kutta Settings : Physical: number of particles Numeric: no integration, number of steps Then: influence FP3, FP64 June, /3

26 Very different performances Nvidia, AMD, Intel, Single/Double 3 D²/T_FP3 TX for amers Tesla for Pros D²/T_FP64 AMD/ATI for Alt* CPUs AMD Intel m m Ti Ti Ti Ti K P 6 X6 X9 9 C M la K la K s la la TX T TX T TX TX TX a s l es la Tes Tes Te Tes e T T i X o U x x x x x x x x x x x 9 PU Ph X # Nan P 6 v v v v 3 v 3 v 4 v 4 n 9 K C - o D D D K H H H R9-9 FX en Xe M E6 X z E R y E E E E E E E R A A June, 6/3

27 Performances normalized to cores Nvidia, AMD, Intel, Single / Double Ratio3 Ratio i i i i U hi 6 9 m m K 4 9 9X # ano PU x x x x x x x x x x x T 6 T 9 T T P 9 CP 6 X 4 6 v v v v3 v3 v4 v4 X 6 X 9 P 9 N n K a K C M l a o K TX T TX T TX TX TX HD HD HD R9-9 FX K en Xe la sla sla sla Tes esl M E6 X s z e e T R E y Te Te T T E E E E E E E - R A A June, /3

28 Performances normalized to TDP Nvidia, AMD, Intel, Single / Double Ratio3W Ratio64W 6 4 m 6 K K a C sl a a sl Te sl e e T T 6 9 X X T T TX X HD R9 June, no Na K A U CP 3 6 E /3 x x x x x 3 4 v v v E E E E

29 Performances normalized to Price Nvidia, AMD, Intel, Single/Double 4 Ratio3Pr 4 Ratio64Pr 3 3 i i i i i m m U 4 9 9X # ano PU x x x x x x x x x x x T 6 T 9 T T Ph K P 9 CP 6 X 6 v v v v3 v3 v4 v4 X 6 X 9 N 9 n K K a T T X C M a a sl la o K TX TX TX T TX HD HD HD R9-9 FX K en a l l Xe M E6 X sl esla Tes Tes Te Tes z e R E y T T E E E E E E E - R A A June, 9/3

30 But what's the use of all this? Characterize & avoid regrets... What you must remember : In a standard machine, in 6, the "raw" power is provided by the PU For a low parallelism, the CPU is much more efficient than the PU A PU is greater than the CPU only for PR> The PU achieves its optimum ONLY for some PRs Without a deep characterization, disappointments to foresee... OpenCL is a good compromise like language * PU Python remains the fastest approach of OpenCL What to do: experiment! June, 3/3

Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU.

Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU. Astrosim 217 Les «Houches» the 7th ;-) PU : the disruptive technology of 21th century Little comparison between CPU, MIC, PU Emmanuel Quémener A (human) generation before... A serial B film in 1984 1984

More information

Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? Emmanuel Quemener IR CBP

Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? Emmanuel Quemener IR CBP Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? IR CBP Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2

More information

Starting with a tiny joke!

Starting with a tiny joke! Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2 languages? Trilingual people! Bilingual people! How do you call people speak 1 language? French people!

More information

Houches 2016 : 6th School. Computational Physics. Travels, tries, traces, traps, tricks, trends and trolls...

Houches 2016 : 6th School. Computational Physics. Travels, tries, traces, traps, tricks, trends and trolls... Houches 2016 : 6th School Computational Physics Travels, tries, traces, traps, tricks, trends and trolls... Event log of a native physicist inside the world of parallelism how the heat frontier led disruptive

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

SIDUS Extreme Deduplication & Reproducibility

SIDUS Extreme Deduplication & Reproducibility SIDUS Extreme Deduplication & Reproducibility Single Instance Distributing Universal System A Swiss Knife for Scientific Computing and elsewhere 1/46 Starting with a tiny joke... How do you call people

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Houches 2016 : 6th School. Computational Physics. GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU

Houches 2016 : 6th School. Computational Physics. GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU Houches 2016 : 6th School Computational Physics GPU : the disruptive technology of 21th century Little comparison between CPU, MIC, GPU Emmanuel Quémener A (human) generation before... A serial B film

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Starting with a tiny joke!

Starting with a tiny joke! Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2 languages? Trilingual people! Bilingual people! How do you call people speak 1 language? French people!

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Generators at the LHC

Generators at the LHC High Performance Computing for Event Generators at the LHC A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, 2015. Higgs boson production in association with a jet at NNLO using jettiness

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

Optimized Scientific Computing:

Optimized Scientific Computing: Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant

More information

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Back from School. IN2P Computing School «Heterogeneous Hardware Parallelism» Vincent Lafage

Back from School. IN2P Computing School «Heterogeneous Hardware Parallelism» Vincent Lafage Back from School IN2P3 2016 Computing School «Heterogeneous Hardware Parallelism» Vincent Lafage lafage@ipnoin2p3fr S2I, Institut de Physique Nucléaire d Orsay Université Paris-Sud 9 October 2016 1 / 18

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Accelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology

Accelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology Accelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology IEEE Cluster 2011 Sept 28, 2011 Satoshi s three questions 1. What does accelerated computing solve

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

An Empirical Methodology for Judging the Performance of Parallel Algorithms on Heterogeneous Clusters

An Empirical Methodology for Judging the Performance of Parallel Algorithms on Heterogeneous Clusters An Empirical Methodology for Judging the Performance of Parallel Algorithms on Heterogeneous Clusters Jackson W. Massey, Anton Menshov, and Ali E. Yilmaz Department of Electrical & Computer Engineering

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Particleworks: Particle-based CAE Software fully ported to GPU

Particleworks: Particle-based CAE Software fully ported to GPU Particleworks: Particle-based CAE Software fully ported to GPU Introduction PrometechVideo_v3.2.3.wmv 3.5 min. Particleworks Why the particle method? Existing methods FEM, FVM, FLIP, Fluid calculation

More information

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015 The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA

More information

Accelerator programming with OpenACC

Accelerator programming with OpenACC ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Big Orange Bramble. August 09, 2016

Big Orange Bramble. August 09, 2016 Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University

More information

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

April 2 nd, Bob Burroughs Director, HPC Solution Sales

April 2 nd, Bob Burroughs Director, HPC Solution Sales April 2 nd, 2019 Bob Burroughs Director, HPC Solution Sales Today - Introducing 2 nd Generation Intel Xeon Scalable Processors how Intel Speeds HPC performance Work Time System Peak Efficiency Software

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information

Hardware Recommendations for SOLIDWORKS 2017

Hardware Recommendations for SOLIDWORKS 2017 Hardware Recommendations for 2017 Minimum System OS: Windows 10, Windows 8.1 64, or Windows 7 64 CPU: Intel i5 Core Intel i7 Dual Core, or equivalent AMD Hard Drive: >250GB, 7200rpm Graphics Card: 2GB

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother? Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Introducing the Intel Xeon Phi Coprocessor Architecture for Discovery

Introducing the Intel Xeon Phi Coprocessor Architecture for Discovery Introducing the Intel Xeon Phi Coprocessor Architecture for Discovery Imagine The Possibilities Many industries are poised to benefit dramatically from the highly-parallel performance of the Intel Xeon

More information

The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration

The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17415 Reference Architecture Dell EMC Solutions Copyright

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen GPU Computing Applications Here's what Nvidia says its Tesla K20(X) card excels at doing - Seismic processing, CFD, CAE, Financial computing,

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

GPUS FOR NGVLA. M Clark, April 2015

GPUS FOR NGVLA. M Clark, April 2015 S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex

More information

High Performance Computing The Essential Tool for a Knowledge Economy

High Performance Computing The Essential Tool for a Knowledge Economy High Performance Computing The Essential Tool for a Knowledge Economy Rajeeb Hazra Vice President & General Manager Technical Computing Group Datacenter & Connected Systems Group July 22 nd 2013 1 What

More information

MCMini: Monte Carlo on GPGPU

MCMini: Monte Carlo on GPGPU MCMini: Monte Carlo on GPGPU Ryan Marcus rmarcus@lanl.gov Summer 2012 Abstract MCMini is a proof of concept that demonstrates the possibility for Monte Carlo neutron transport using OpenCL with a focus

More information

Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory

Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory Soramichi Akiyama Department of Creative Informatics Graduate School of Information Science and Technology The University

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

Partitioning and Divide-and-Conquer Strategies

Partitioning and Divide-and-Conquer Strategies Partitioning and Divide-and-Conquer Strategies Chapter 4 slides4-1 Partitioning Partitioning simply divides the problem into parts. Divide and Conquer Characterized by dividing problem into subproblems

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011 ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include

More information

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging Saoni Mukherjee, Nicholas Moore, James Brock and Miriam Leeser September 12, 2012 1 Outline Introduction to CT Scan, 3D reconstruction

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Computing and energy performance

Computing and energy performance Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas

More information

Avigilon Control Center System Integration Guide

Avigilon Control Center System Integration Guide Avigilon Control Center System Integration Guide with Software House C Cure 9000 INT-CCURE-C-Rev1 Copyright 2013 Avigilon. All rights reserved. No copying, distribution, publication, modification, or incorporation

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information