From modelisation house To test center. Emmanuel Quémener
|
|
- Oscar Maxwell
- 6 years ago
- Views:
Transcription
1 Presentation of IT resources provided by Blaise Pascal Center From modelisation house To test center Emmanuel Quémener
2 Blaise Pascal Center «House of Modelisation» &... Conferences Trainings Projects Hotel June, /3
3 Centre Blaise Pascal ~ Dryden FR Small illustrative example Nasa X-9 Cell from F- Engine from F- ears from F-6 Studies Flap wings Incidence > «Fly-By-Wire» To recycle, to reuse and explore from new domains June, 3/3
4 From engines to parallel envelope A necessary exploration? Flight Envelope Parallel Envelope Software Hardware Speed/altitude Parallelization/Memory/CPU/PU Load factor Input/Output/Contexts/... June, 4/3
5 CBP also a Test Center Experimental platform with technical benchs Multi-nodes: clusters from 4 to 64 nodes, Nodes/Cores : 4/4, /64, /64, 64/ Multi-cores : more than machines from to cores, Nodes/Servers: from to cores, Workstations: from to 6 cores, types of CPU différents PU & Accelerator : 46 different models of PU (AMD & Nvidia), Intel MIC PPU : 9 ; PU Nvidia : ; PU AMD/ATI : types ; Xeon Phi P Integration : 4 virtual machines : Debian from Lenny to Sid both 3 & 64 bits,... Exotic hardware : 3 machines ARMv under Debian Jessie or Ubuntu 3D Vizualisation: stations, videoprojectors, monitors, 4 glasses 3D Virtual Desktop : up to 3 hosts with xgo/virtuall COMOD with SIDUS : «Compute On My Own Device» for laboratories users SIDUS is «Single Instance Distributing Universal System» alaxy prototype for intensive processing of biological data June, /3
6 Some «Digital Benchs» permanent or ephemeral! Schools in Les Houches about «computational physics» : «standalone» complete virtual machine SIDUS virtual machine Digital Humanities : eography : virtual machines and already operational ones Vizualisation workstations for graphical processing Biology : alaxy gateway (with cluster nodes processing) Processing and display stations for MorphoraphX Server provisioning to study & process under Repeat* with lusterfs, TMPFS & BRD Emmanuel Quemener June, 6/3
7 To what does it serve? Example : Repeat* lusterfs/tmpfs Progression «Input Database Coverage» Progression «Family Refinement» Less is Better! More is Better!. 3 R RamDisk. R3 lusterfs R Ramdisk R3 lusterfs Accélération %.3%.9%.9% Emmanuel Quemener June,.4% /
8 An example of interactions Between Laboratory/CBP/PSMN ive U t ap P d a nt AMD y e u B i pm u Eq Preproduction Laboratory Evaluation Test center Nvidia Sy Kno st wem H s & ow Ne SID tw US or ks Production Hardware CBP / Sidus OS CBP Compute Center Influence the choice of hardware background Hardware LBMC / Sidus OS CBP Emmanuel Quemener June, Hardware PSMN / Sidus OS PSMN /3
9 How to manage + machines? SIDUS! Single Instance Distributing Universal System June, 9/3
10 Exemples of studies : predictions... Overshoot Amdahl Law with Mylq One From T=T+p/N to T=T+c*N+p/N PiMC : parallel execution with C/MPI for x x Mesures 4x Itops (Iterative Operations Per Second) 3.x 3x.x x.x x x Amdahl & Mylq Laws Mylq Law Fit on range [-] Mylq Law Mylq Law Fit on range [-6] Amdahl Law Mylq Law Fit on range [-] Mylq Law Amdahl Law 4x 3x x x Mylq Law Fit on range [-64] Itops (Iterative Operations Per Second) 4.x distributed iterations Mesures PiMC : parallel execution with C/MPI for distributed iterations Parallel Rate (NP in MPI) Emmanuel Quemener June, Parallel Rate (NP in MPI) /3 3 4
11 From (multi many)-cores to myri-alus? A parallel laboratory on a chip! You got from to dozen of thousands of equivalent PU! You got the choice between : CPU MIC PU How to choose and distinguish them for its use? June, /3
12 A long long time ago, between 9... and, variable is frequency! Power vs RPM x in years & stop (and even a drop) June, /3
13 The "engine", source of "power" Forget frequency, change RPM to PR 4. Horse Power vs RPM Pthreads OpenMP 3. MPI OpenCL Intel 3. OpenCL AMD. Theory What "behaviors" for these engines at "load-bearing"? The "engine", a system entangling hardware, OS and software June, 3/3
14 Why is the PU so powerful? Shadering & Matrix Calculation Model World: 3 matrix products Rotation Translation Scalinge MW World View: matrix products Positioning the camera Location Direction View Projection WV VP Pixelisation So a PU: it's a "big" Matrix Multiplier June, 4/3
15 BLAS comparaison : CPU vs PU xemm when x=d(p) or S(P). OpenBLAS DP 9. clblas DP.. 6. Tesla for Pros Thunking DP TX for amers CuBLAS DP OpenBLAS SP clblas SP AMD/ATI For Alt*. 4. Thunking SP CuBLAS SP 3. CPUs AMD. Intel.. m n Ti Ti Ti Ti 69 Tita K P TX TX X k C M la K s la la TX X X X o a TX T T T T TX s l es la Tes Te Tes dr e a T T u Q June, 9 HD r y # no Fu x Na 9 R - 9 R9 Xe on i Ph 9 9 X F n ze Ry /3 3 x x 3 x 3 x 4 x 4 x E6 66 v v v v x E E E E E
16 What Nvidia never broadcasts... xemm on a large domain... Performance Tesla P TX Size Yes, performance is there, but in specific conditions! June, 6/3
17 Which simple code to implement in //? PiMC: Pi by the method of the target Historical example of the Monte Carlo method Parallel implementation: Equivalent distribution of iterations From to 4 parameters Number of iterations Parallel Rate (PR) (Type of variable: INT3, INT64, FP3, FP64) (RN: MWC, CON, SHR3, KISS) simple observables: Estimation of Pi (just indicative, Pi not being rational :-)) Elapsed Time (Individual or lobal) June, /3
18 With QPU, low "regime"... The CPU explodes the PU! From to x slower! TX PiOpenCL PiOCL Base PiOpenCL Boost PiOpenCL TX 9Ti PiOCL Base Tesla K4c PiOpenCL Boost Quadro K4 TX Titan R9-9 R9-9X Xeon Phi E-6v3 E-63v i i 66 v v v3 Ph 9X 9 itan 4c T 3 X eon 9- R9 TX T K4 la K X 9 X E R T T o X s dr E E E Te a u Q E-6v E-66 X E. + E. + E E E E June,. order of magnitude... /3
19 Is Python effective? A quick comparison... MPI+PyOpenCL HT MPI+PyOpenCL NoHT MPI+OpenMP/C MPI/C + E. Itops E+. E+. 4 E+. 6 E+. E+. E+. + E 4. ITOPS 64 nodes with cores 3 implementations ( hybrid) June, 9/3
20 Pi Monte Carlo: the match... Python vs C & CPU vs PU vs Phi E+ E+ E+ E+ Tesla for Pros TX for amers AMD/ATI E+ E+ 6E+ 4E+ E+ CPU E+ i i i n i i i 6 9 m m # ds T 6 # ita T T T T T 3 4 ea 6 X 69 X X T X X X X 9 X S o 4 K K4 C M M la K la K K hr la P T V T T T T T a T N adr dr o dr o TX TX T TX TX TX TX T TX sl sla sla es es sla p es u Te Te Te T T Te + T Q Qua Qua K a l s e T June, T T T T T H C c c c T T 9 U 4 ur y ur y 9 # /C /C P 3 I+ 4 4 oh L H PI P P o F 9 F X M E6 MP v3 v4 M nm X- n D D K R an R R R 9 i L N PyC c e h F m m H H e c X C p P N y I+ D yz 9 tiu ti u 4 el 6 6 4n +O R - R IC en en Int tel tel +P P AM D r6 I M PI c M l P el P X x x In x In te MP e M ia M s t t l r A c 4n lu c In In te ve C 4n n 6 x x x In Ka 6 64 ter er er lus t t s s C lu lu C C D /3
21 Small comparison between * PU Low pararegime: the reign of the CPU à PR June, /3
22 Small comparison between *PU MIC Phi Out of Wood, PU De à 4 de PR June, /3
23 Deep Exploration of PR ParaRégime from 4 to 634 Xeon Phi with 6 cœurs 6x 6x4 6x64 6x64 PR from 4 to 634 AMD HD9 with 4 ALU AMD R9-9X with 6 ALU 4x4 6x4 Nvidia TX Titan & Tesla K4 Prime numbers > 4 June, 3/3
24 Very deep exploration of PRs Regime of //ism from 4 to 636 PR from 4 to 636 AMD HD9 & R9-9 Large Period : 4x number of ALU Period of 4 (4 SMX) Nvidia TX Titan & Tesla K4 Small Period : number of SMX units Period of ( SMX) June, 4/3
25 "Back to the Physics" A Newtonian N-body code... Pass on a code "fine grain" Code N-body very simply integrated Newton's second law, autogravitating system Differential integration methods: Euler Implicite, Euler Explicit, Heun, Runge Kutta Settings : Physical: number of particles Numeric: no integration, number of steps Then: influence FP3, FP64 June, /3
26 Very different performances Nvidia, AMD, Intel, Single/Double 3 D²/T_FP3 TX for amers Tesla for Pros D²/T_FP64 AMD/ATI for Alt* CPUs AMD Intel m m Ti Ti Ti Ti K P 6 X6 X9 9 C M la K la K s la la TX T TX T TX TX TX a s l es la Tes Tes Te Tes e T T i X o U x x x x x x x x x x x 9 PU Ph X # Nan P 6 v v v v 3 v 3 v 4 v 4 n 9 K C - o D D D K H H H R9-9 FX en Xe M E6 X z E R y E E E E E E E R A A June, 6/3
27 Performances normalized to cores Nvidia, AMD, Intel, Single / Double Ratio3 Ratio i i i i U hi 6 9 m m K 4 9 9X # ano PU x x x x x x x x x x x T 6 T 9 T T P 9 CP 6 X 4 6 v v v v3 v3 v4 v4 X 6 X 9 P 9 N n K a K C M l a o K TX T TX T TX TX TX HD HD HD R9-9 FX K en Xe la sla sla sla Tes esl M E6 X s z e e T R E y Te Te T T E E E E E E E - R A A June, /3
28 Performances normalized to TDP Nvidia, AMD, Intel, Single / Double Ratio3W Ratio64W 6 4 m 6 K K a C sl a a sl Te sl e e T T 6 9 X X T T TX X HD R9 June, no Na K A U CP 3 6 E /3 x x x x x 3 4 v v v E E E E
29 Performances normalized to Price Nvidia, AMD, Intel, Single/Double 4 Ratio3Pr 4 Ratio64Pr 3 3 i i i i i m m U 4 9 9X # ano PU x x x x x x x x x x x T 6 T 9 T T Ph K P 9 CP 6 X 6 v v v v3 v3 v4 v4 X 6 X 9 N 9 n K K a T T X C M a a sl la o K TX TX TX T TX HD HD HD R9-9 FX K en a l l Xe M E6 X sl esla Tes Tes Te Tes z e R E y T T E E E E E E E - R A A June, 9/3
30 But what's the use of all this? Characterize & avoid regrets... What you must remember : In a standard machine, in 6, the "raw" power is provided by the PU For a low parallelism, the CPU is much more efficient than the PU A PU is greater than the CPU only for PR> The PU achieves its optimum ONLY for some PRs Without a deep characterization, disappointments to foresee... OpenCL is a good compromise like language * PU Python remains the fastest approach of OpenCL What to do: experiment! June, 3/3
Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU.
Astrosim 217 Les «Houches» the 7th ;-) PU : the disruptive technology of 21th century Little comparison between CPU, MIC, PU Emmanuel Quémener A (human) generation before... A serial B film in 1984 1984
More informationGraphical Processing Units : Can we really entrust the compiler to reach the raw performance? Emmanuel Quemener IR CBP
Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? IR CBP Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2
More informationStarting with a tiny joke!
Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2 languages? Trilingual people! Bilingual people! How do you call people speak 1 language? French people!
More informationHouches 2016 : 6th School. Computational Physics. Travels, tries, traces, traps, tricks, trends and trolls...
Houches 2016 : 6th School Computational Physics Travels, tries, traces, traps, tricks, trends and trolls... Event log of a native physicist inside the world of parallelism how the heat frontier led disruptive
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationSIDUS Extreme Deduplication & Reproducibility
SIDUS Extreme Deduplication & Reproducibility Single Instance Distributing Universal System A Swiss Knife for Scientific Computing and elsewhere 1/46 Starting with a tiny joke... How do you call people
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationHouches 2016 : 6th School. Computational Physics. GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU
Houches 2016 : 6th School Computational Physics GPU : the disruptive technology of 21th century Little comparison between CPU, MIC, GPU Emmanuel Quémener A (human) generation before... A serial B film
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationStarting with a tiny joke!
Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2 languages? Trilingual people! Bilingual people! How do you call people speak 1 language? French people!
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationCUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters
CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty
More informationGenerators at the LHC
High Performance Computing for Event Generators at the LHC A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, 2015. Higgs boson production in association with a jet at NNLO using jettiness
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationOptimized Scientific Computing:
Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant
More informationTiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation
Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationBack from School. IN2P Computing School «Heterogeneous Hardware Parallelism» Vincent Lafage
Back from School IN2P3 2016 Computing School «Heterogeneous Hardware Parallelism» Vincent Lafage lafage@ipnoin2p3fr S2I, Institut de Physique Nucléaire d Orsay Université Paris-Sud 9 October 2016 1 / 18
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationAccelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology
Accelerated Computing Jun Makino Interactive Research Center of Science Tokyo Institute of Technology IEEE Cluster 2011 Sept 28, 2011 Satoshi s three questions 1. What does accelerated computing solve
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationAn Empirical Methodology for Judging the Performance of Parallel Algorithms on Heterogeneous Clusters
An Empirical Methodology for Judging the Performance of Parallel Algorithms on Heterogeneous Clusters Jackson W. Massey, Anton Menshov, and Ali E. Yilmaz Department of Electrical & Computer Engineering
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationTimothy Lanfear, NVIDIA HPC
GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationParticleworks: Particle-based CAE Software fully ported to GPU
Particleworks: Particle-based CAE Software fully ported to GPU Introduction PrometechVideo_v3.2.3.wmv 3.5 min. Particleworks Why the particle method? Existing methods FEM, FVM, FLIP, Fluid calculation
More informationThe Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015
The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA
More informationAccelerator programming with OpenACC
..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationMaximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs
Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationBig Orange Bramble. August 09, 2016
Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University
More informationExploiting CUDA Dynamic Parallelism for low power ARM based prototypes
www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationApproaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department
Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/
More informationApril 2 nd, Bob Burroughs Director, HPC Solution Sales
April 2 nd, 2019 Bob Burroughs Director, HPC Solution Sales Today - Introducing 2 nd Generation Intel Xeon Scalable Processors how Intel Speeds HPC performance Work Time System Peak Efficiency Software
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More informationHardware Recommendations for SOLIDWORKS 2017
Hardware Recommendations for 2017 Minimum System OS: Windows 10, Windows 8.1 64, or Windows 7 64 CPU: Intel i5 Core Intel i7 Dual Core, or equivalent AMD Hard Drive: >250GB, 7200rpm Graphics Card: 2GB
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationParallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?
Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationIntroducing the Intel Xeon Phi Coprocessor Architecture for Discovery
Introducing the Intel Xeon Phi Coprocessor Architecture for Discovery Imagine The Possibilities Many industries are poised to benefit dramatically from the highly-parallel performance of the Intel Xeon
More informationThe Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration
The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17415 Reference Architecture Dell EMC Solutions Copyright
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationSelecting the right Tesla/GTX GPU from a Drunken Baker's Dozen
Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen GPU Computing Applications Here's what Nvidia says its Tesla K20(X) card excels at doing - Seismic processing, CFD, CAE, Financial computing,
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationGPUS FOR NGVLA. M Clark, April 2015
S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40
More informationGPU ARCHITECTURE Chris Schultz, June 2017
Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex
More informationHigh Performance Computing The Essential Tool for a Knowledge Economy
High Performance Computing The Essential Tool for a Knowledge Economy Rajeeb Hazra Vice President & General Manager Technical Computing Group Datacenter & Connected Systems Group July 22 nd 2013 1 What
More informationMCMini: Monte Carlo on GPGPU
MCMini: Monte Carlo on GPGPU Ryan Marcus rmarcus@lanl.gov Summer 2012 Abstract MCMini is a proof of concept that demonstrates the possibility for Monte Carlo neutron transport using OpenCL with a focus
More informationAcceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory
Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory Soramichi Akiyama Department of Creative Informatics Graduate School of Information Science and Technology The University
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationPartitioning and Divide-and-Conquer Strategies
Partitioning and Divide-and-Conquer Strategies Chapter 4 slides4-1 Partitioning Partitioning simply divides the problem into parts. Divide and Conquer Characterized by dividing problem into subproblems
More informationVOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017
VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011
ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include
More informationPresentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories
HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents
More informationLet s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.
Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein
More informationJ. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst
Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationCUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging
CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging Saoni Mukherjee, Nicholas Moore, James Brock and Miriam Leeser September 12, 2012 1 Outline Introduction to CT Scan, 3D reconstruction
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationComputing and energy performance
Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas
More informationAvigilon Control Center System Integration Guide
Avigilon Control Center System Integration Guide with Software House C Cure 9000 INT-CCURE-C-Rev1 Copyright 2013 Avigilon. All rights reserved. No copying, distribution, publication, modification, or incorporation
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationGPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA
GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles
More informationNVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas
NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More information