Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.
|
|
- Gloria Preston
- 6 years ago
- Views:
Transcription
1 Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract No. DE-AC05-00OR ORNL is managed by UT-Battelle for the US Department of Energy
2 Application-Hardware Interface Direct Interface Uni-Indirect Interface Bi-Indirect Interface HPC Application HPC Application HPC Application Interface Width Interface Width Interface Width Hardware/HPC-system Runtime + Libraries Domain-Specific Runtime + Libraries Interface Width ~ O(1): Manageable Performance Portability, regardless of the programming model (only O(1) parts of the code interact with raw hardware) Interface Width Interface Width Hardware/HPC-system Hardware/HPC-system Performance Portability effort is shifted towards Runtime and Libraries Performance Portability effort is shifted towards domain-specific Runtime and Libraries Interface Width ~ O(N): Performance Portability can become an issue, even if using OpenMP/OpenACC HPC Application responsibility to be aware of Hardware 2 ExaTensor Performance portability effort is shifted towards Runtime and Libraries HPC Application is aware of Runtime, Runtime is responsible to be aware of Hardware DS Runtime is aware of both the Application and Hardware
3 Domain-Specific Virtual Processor HPC Application Understands the specificity of domain-specific data, operations, and algorithms Interface Width Domain-Specific Virtual Processor + Libraries Understands the specificity of a class of HPC architectures via a parameterized template of node architectures A better structured runtime that formally resembles a processor but specialized to the domainspecific workloads Interface Width Hardware/HPC-system The domain HPC applications express their algorithms in a domain-native language (higher-level), either via a standalone or embedded DSL; The domain algorithms are compiled into instructions for a domain-aware virtual architecture; The virtual architecture is built from virtual units that map reasonably well to the physical hardware; Hardware encapsulation, yet potential for a seemless co-design. 3 ExaTensor
4 Domain-Specific Virtual Processor Parameterized domain-specific virtual architecture template DS-RDMA DS Hierarchical Data Cache DS Instruction Queue DS Metadata Cache DS Checkpoint 4 ExaTensor Device 3 Instruction Queue Device 2 Instruction Queue Device 1 Instruction Queue Device 0 Instruction Queue Device 3 Device 2 Device 1 Device 0 Requires: Building blocks for heterogeneous computing: HiHat? DS Instruction Scheduler
5 Performance Portability Strategy Abstract computing system template: Distributed (weakly-coupled) level: The computing system is composed of compute nodes interconnected via network interfaces in some topology (Semi-)shared (strongly-coupled) level: Each node is composed of multiple compute devices of the same or different kinds, possibly sharing the same (hierarchical) memory Algorithms are formulated for this abstract computer The hardware specificity is masked by driver libraries that provide a device-unified API interface for a set of necessary domain-specific primitives (DS-ISA) New hardware = New driver library 5 ExaTensor
6 Use Case: Numerical Tensor Algebra Quantum many-body theory: Electronic and Nuclear structure, Quantum computing: Wave-function Loop quantum gravity: Fine structure of space-time Multivariate data analytics: Extracting correlations, features, data compression Classical physics Any data arranged as a multi-dimensional array is a tensor (in the most general sense) Any function of multiple variables discretized over a grid or some functional inner-product space is a tensor 6 ExaTensor
7 Coupled-Cluster Theory j i a j i a 7 ExaTensor b i i b k Electron correlation = Correlation between hole-particle excitations j c ab j ab
8 Automated Development Tools and DSL Automated equation/code generators are unavoidable DSLangs and DSLibs can provide a scalable development environment: DiaGen input/output: 8 ExaTensor Efficient flexible runtime?
9 Past/Present Experience : CLUSTER: Automated code generation: CAS-CCSD method = Millions of generated SLOC : CLUSTER moved to a direct interpretation of many-body CC equations: High-level spec Bytecode : ACES III and ACES IV: Domain-specific superinstruction language and runtime (SIAL/SIP): SIAL (med.level) SIP bytecode Interpretation by SIP 2014+: ExaTENSOR framework: Direct interpretation of generic tensor expressions on heterogeneous HPC architectures with cross-domain applications Other efforts: Code generation (TCE), embedded DSL + Tensor Algebra library (Aquarius/Cyclops, TiledArray) 9 ExaTensor
10 Math Framework: Basic Tensor Algebra pq... rs... Formal tensor: T T ( p, q, r, s,...) : n-d Array Full tensor: T ( p, q, r, s ) : p P, q Q, r R, s S Tensor slice: T ( p, q, r, s ) : p P ' P, q Q' Q, r R ' R, s S ' S Few primitive operations: Tensor addition: pq rs p, q, r, s : T Tensor product: pq rs p, q, r, s : T Tensor contraction: pq rs p, q, r, s : T 10 ExaTensor pq rs pq rs L R Parallelism! p r q s L R Parallelism! pai bcd qbcd rsai L R Compute intensive (potentially)!
11 Math Framework: Tensor Decompositions Graphical (diagrammatic) representation: Matrix: Matrix*Matrix: Linear algebra: SVD is optimal in the 2-norm: Tensor (multi-linear) algebra: Many choices: Canonical polyadic 11 ExaTensor Tucker Tensor tree Tensor train
12 Tensor Sparsity Dense tensors: Block-sparse tensors (container of dense slices): Regular sparse tensors: Some regular sparsity pattern other than block sparsity Irregular sparse tensors: Sparsity pattern has no regularity 12 ExaTensor
13 Adaptive (+Hierarchical) Tensor Algebra Extrapolation of H/H2-matrix algebra to TENSORS 13 ExaTensor Large tensor elements can become tensors themselves (higher resolution); Weak tensor slices can be compressed by lowering the resolution, up to a single (complex) number; Adapt to the calculated electronic state and available HPC resources; Should be better than just black-and-white discarding. D.I.L. Int. J. Quantum Chem. 2014; D.I.L. ArXiv 2017
14 Computational Challenges Dense tensor algebra: Communication bandwidth: Communication avoiding and regularization (similar to the matrix multiplication, e.g., CARMA by Demmel et al.) Memory size: Out-of-core algorithms Block-sparse tensor algebra adds: Irregular data placement and computational workload: Load balancing Task-based programming model Adaptive hierarchical block-sparse tensor algebra adds: Dynamic irregular data placement and computational workload: Load balancing, dynamic data layout Adaptive task-based programming model ExaTENSOR: 1. Implementation of the tensor algebra DSVP for heterogeneous computing; 2. Implementation of scalable tensor algebra algorithms for TA-DSVP. 14 ExaTensor
15 Domain-Specific Virtual Processing Domain-specific algorithm Domain-specific virtual processor(s) Physical processor(s): Domain-Specific Virtual Processor (abstract class) Domain-Specific Instruction (abstract class) Domain-Specific Operand (abstract class) Tensor Algebra Processor Tensor Instruction Tensor Operand Logic Processor (control, meta-data) Numeric Processor (numeric tensor ops) Why this way? 15 ExaTensor Less unneeded detail, more focus on the domain-specific algorithms (TA); Better portability; Better debugability/profileability of domain-specific algorithms; Better opportunities for co-design. Heterogeneous DSVP (ExaTENSOR)
16 Node-Level Virtualization: Hiding Hardware Hardware Processors (CPU, GPU, etc.) 16 ExaTensor
17 Node-Level Virtualization: Hiding Hardware Domain-Specific Virtual Processor (DSVP) ExaTENSOR DSVP Domain-Specific Instructions Data 17 ExaTensor Data
18 Node-Level Virtualization: Hiding Hardware Domain-Specific Virtual Processor (DSVP) ExaTENSOR DSVP Domain-Specific Instructions Driver Libraries (Tensor Algebra) Data 18 ExaTensor Data
19 Global Virtualization: Hiding HPC Scale 19 ExaTensor
20 Global Virtualization: Hiding HPC Scale 20 ExaTensor
21 Global Virtualization: Hiding HPC Scale 21 ExaTensor
22 Global Virtualization: Hiding HPC Scale 22 ExaTensor
23 Global Virtualization: Hiding HPC Scale 23 ExaTensor
24 Recursive Dynamic Task Scheduling Data storage granularity is decoupled from the task granularity O(1) 24 ExaTensor O(1) O(1)
25 TAL-SH: On-Node Tensor Algebra Layer Basic tensor algebra operations: Tensor contraction, tensor product, tensor addition, tensor norm, etc. Supports multicore CPU and (multiple) NVIDIA GPU Asynchronous task-based execution on accelerators Automatic memory and data transfer management, data transfer pipelining User-controlled data presence management (images) Tensor contractions: Up to 1.1 TFlop/s double precision on NVIDIA Kepler K20x for arithmetic intensities > 1000 Up to 4.2 TFlop/s double precision on NVIDIA Pascal P100 for arithmetic intensities > ExaTensor
26 TAL-SH: On-Node Tensor Algebra Layer 26 ExaTensor
27 cutt: Fast Tensor Transpose for GPU 27 ExaTensor A.P.Hynninen, D.I.L. ArXiv 2017
28 Conclusions Domain-Specific Virtual Processor pros: Domain algorithm friendly (higher-level): Application-Runtime codesign, customizable virtual architecture; Hardware agnostic by design: Only requires new driver libraries with domain-specific primitives; HPC scale agnostic by design (via a recursive HPC system view); Better fault-tolerance (autonomous virtual nodes can be turned on/off); Good model for Hardware-Runtime codesign; Better debugging (in terms of higher-level operations); Better profiling (in terms of higher-level operations). Domain-Specific Virtual Processor cons: Level of effort; Runtime overhead (like any virtual machine). 28 ExaTensor
29 Potential Programming Models Direct non-task static: Algorithm Target code (MPI+X or PGAS+X) Hardware Direct task static: Algorithm Parameterized static DAG Task-based runtime Hardware Direct task dynamic: Algorithm Dynamic DAG Task-based runtime Hardware Indirect non-task static: Algorithm DSL Domain-specific bytecode Runtime interpretation Hardware Indirect task static: Algorithm DSL Parameterized static DAG Task-based runtime Hardware Indirect task dynamic: Algorithm DSL Dynamic DAG Task-based runtime Hardware 29 ExaTensor
Preparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationAlgorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing
Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing Bo Kågström Department of Computing Science and High Performance Computing Center North (HPC2N) Umeå University,
More informationSampling Using GPU Accelerated Sparse Hierarchical Models
Sampling Using GPU Accelerated Sparse Hierarchical Models Miroslav Stoyanov Oak Ridge National Laboratory supported by Exascale Computing Project (ECP) exascaleproject.org April 9, 28 Miroslav Stoyanov
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationPhilip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory
Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory A Tree-Based Overlay Network (TBON) like MRNet provides scalable infrastructure for tools and applications MRNet's
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationQuery Compilation of Dataflow Programs for Heterogeneous Platforms. Felix Beier & Kai-Uwe Sattler Ilmenau
Programs for Heterogeneous Platforms Felix Beier & Kai-Uwe Sattler DBIS@TU Ilmenau www.tu-ilmenau.de/dbis Outline 1. Requirements of Engineering Applications 2. Dataflow Programming in PipeFlow 3. Query
More informationIOS: A Middleware for Decentralized Distributed Computing
IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc
More informationTuring Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA
Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationWrite a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
More informationFCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems
FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems Feiyi Wang (Ph.D.) Veronica Vergara Larrea Dustin Leverman Sarp Oral ORNL is managed by UT-Battelle for the US Department
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationIntroduction to HPC Parallel I/O
Introduction to HPC Parallel I/O Feiyi Wang (Ph.D.) and Sarp Oral (Ph.D.) Technology Integration Group Oak Ridge Leadership Computing ORNL is managed by UT-Battelle for the US Department of Energy Outline
More informationUltra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.
Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient
More informationSuper Instruction Architecture for Heterogeneous Systems. Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders
Super Instruction Architecture for Heterogeneous Systems Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders Super Instruc,on Architecture Mo,vated by Computa,onal Chemistry Coupled
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationFast Multipole and Related Algorithms
Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationEd D Azevedo Oak Ridge National Laboratory Piotr Luszczek University of Tennessee
A Framework for Check-Pointed Fault-Tolerant Out-of-Core Linear Algebra Ed D Azevedo (e6d@ornl.gov) Oak Ridge National Laboratory Piotr Luszczek (luszczek@cs.utk.edu) University of Tennessee Acknowledgement
More informationAmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationdesigning a GPU Computing Solution
designing a GPU Computing Solution Patrick Van Reeth EMEA HPC Competency Center - GPU Computing Solutions Saturday, May the 29th, 2010 1 2010 Hewlett-Packard Development Company, L.P. The information contained
More informationEarly Experiences Writing Performance Portable OpenMP 4 Codes
Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic
More informationHow to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO
How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationDARPA-BAA Hierarchical Identify Verify Exploit (HIVE) Frequently Asked Questions (FAQ) August 18, 2016
DARPA-BAA-16-52 Hierarchical Identify Verify Exploit (HIVE) Frequently Asked Questions (FAQ) August 18, 2016 DARPA-BAA-16-52 Hierarchical Identify Verify Exploit (HIVE) Frequently Asked Questions (FAQ)
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationAMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016
AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP
More informationPower dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.
The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationData Management in Parallel Scripting
Data Management in Parallel Scripting Zhao Zhang 11/11/2012 Problem Statement Definition: MTC applications are those applications in which existing sequential or parallel programs are linked by files output
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationTitan - Early Experience with the Titan System at Oak Ridge National Laboratory
Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid
More informationPerformance Measurement and Evaluation Tool for Large-scale Systems
Performance Measurement and Evaluation Tool for Large-scale Systems Hong Ong ORNL hongong@ornl.gov December 7 th, 2005 Acknowledgements This work is sponsored in parts by: The High performance Computing
More informationOverview of Tianhe-2
Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationDataSToRM: Data Science and Technology Research Environment
The Future of Advanced (Secure) Computing DataSToRM: Data Science and Technology Research Environment This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationCUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA
CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA
More informationcustinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader
custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality
More informationStatistical Models for Automatic Performance Tuning
Statistical Models for Automatic Performance Tuning Richard Vuduc, James Demmel (U.C. Berkeley, EECS) {richie,demmel}@cs.berkeley.edu Jeff Bilmes (Univ. of Washington, EE) bilmes@ee.washington.edu May
More informationDynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California
Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationScalable, Automated Characterization of Parallel Application Communication Behavior
Scalable, Automated Characterization of Parallel Application Communication Behavior Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory 12 th Scalable Tools Workshop
More informationPaving the Road to Exascale
Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015
More informationExperiences with CUDA & OpenACC from porting ACME to GPUs
Experiences with CUDA & OpenACC from porting ACME to GPUs Matthew Norman Irina Demeshko Jeffrey Larkin Aaron Vose Mark Taylor ORNL is managed by UT-Battelle for the US Department of Energy ORNL Sandia
More informationCharm++ for Productivity and Performance
Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil
More informationThe Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009
The Many-Core Revolution Understanding Change Alejandro Cabrera cpp.cabrera@gmail.com January 29, 2009 Disclaimer This presentation currently contains several claims requiring proper citations and a few
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationParallel Combinatorial BLAS and Applications in Graph Computations
Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations
More informationThe Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer
The Path to GPU as a Service in Kubernetes Renaud Gaubert , Lead Kubernetes Engineer May 03, 2018 RUNNING A GPU APPLICATION Customers using DL DL Application RHEL 7.3 CUDA 8.0 Driver 375
More informationMatrix Multiplication
Systolic Arrays Matrix Multiplication Is used in many areas of signal processing, graph theory, graphics, machine learning, physics etc Implementation considerations are based on precision, storage and
More informationA Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video
A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video Amrita Mazumdar Armin Alaghi Jonathan T. Barron David Gallup Luis Ceze Mark Oskin Steven M. Seitz University of Washington Google
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationFatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms
FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms Luna Xu (Virginia Tech) Seung-Hwan Lim (ORNL) Ali R. Butt (Virginia Tech) Sreenivas R. Sukumar (ORNL) Ramakrishnan
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationParallel 3D Sweep Kernel with PaRSEC
Parallel 3D Sweep Kernel with PaRSEC Salli Moustafa Mathieu Faverge Laurent Plagne Pierre Ramet 1 st International Workshop on HPC-CFD in Energy/Transport Domains August 22, 2014 Overview 1. Cartesian
More informationBlocking Optimization Strategies for Sparse Tensor Computation
Blocking Optimization Strategies for Sparse Tensor Computation Jee Choi 1, Xing Liu 1, Shaden Smith 2, and Tyler Simon 3 1 IBM T. J. Watson Research, 2 University of Minnesota, 3 University of Maryland
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationPorting Scientific Applications to OpenPOWER
Porting Scientific Applications to OpenPOWER Dirk Pleiter Forschungszentrum Jülich / JSC #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 JSC s HPC Strategy IBM Power 6 JUMP, 9 TFlop/s Intel
More informationParallel Programming Futures: What We Have and Will Have Will Not Be Enough
Parallel Programming Futures: What We Have and Will Have Will Not Be Enough Michael Heroux SOS 22 March 27, 2018 Sandia National Laboratories is a multimission laboratory managed and operated by National
More informationAdvanced Parallel Programming. Is there life beyond MPI?
Advanced Parallel Programming Is there life beyond MPI? Outline MPI vs. High Level Languages Declarative Languages Map Reduce and Hadoop Shared Global Address Space Languages Charm++ ChaNGa ChaNGa on GPUs
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationUsing R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov
Using R for HPC Data Science Session: Parallel Programming Paradigms George Ostrouchov Oak Ridge National Laboratory and University of Tennessee and pbdr Core Team Course at IT4Innovations, Ostrava, October
More informationHardware-Software Codesign
Hardware-Software Codesign 8. Performance Estimation Lothar Thiele 8-1 System Design specification system synthesis estimation -compilation intellectual prop. code instruction set HW-synthesis intellectual
More informationMay 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017
May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling
More informationIntel Array Building Blocks
Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call
More informationGTC 2017 S7672. OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open CFD Solver
David Gutzwiller, NUMECA USA (david.gutzwiller@numeca.com) Dr. Ravi Srinivasan, Dresser-Rand Alain Demeulenaere, NUMECA USA 5/9/2017 GTC 2017 S7672 OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open
More informationAccelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware
NSF REU - 2018: Project Report Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware Anumeena Sorna Electronics and Communciation Engineering National Institute of Technology,
More informationOptimizing and Accelerating Your MATLAB Code
Optimizing and Accelerating Your MATLAB Code Sofia Mosesson Senior Application Engineer 2016 The MathWorks, Inc. 1 Agenda Optimizing for loops and using vector and matrix operations Indexing in different
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More information