Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework
|
|
- Bryce Andrew Powers
- 5 years ago
- Views:
Transcription
1 Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework Erik Schne+er Perimeter Ins1tute for Theore1cal Physics XSCALE 2013, Boulder, CO,
2
3 Gamma- Ray Bursts ~10 7 km He Protoneutron Star Accretion Collapse to a Black Hole Jet Formation and Sustainment Fe-group nuclei Si O/Ne/Mg C (not drawn to scale) Iron Core Collapse Accretion with Accretion Disk Jet Propagation / Breakout Disruption of Star Intense, narrowly- beamed flashes of high- energy photons most energe1c events in today s universe Mechanisms s1ll a riddle (grand challenge in astrophysics) Gravita1onal waves likely to be detected by LIGO in coming years At the intersec1on of many different fields of physics Afterglow Emission
4 Estimated Requirements to Model a Long GRB ab initio Computa1on Physics Stage Core collapse, Supernova EOS, neutrino transport, MHD, GR 25 m 10,000 km 1 ms 2 s AMR: 11 levels 640M cells 5M steps 18 TByte 270,000 PFlop (3 days) Accre1on, Jet forma1on Neutrino transport, MHD, GR 25 m 1,000 km 1 ms 200 s AMR: 10 levels 80M cells 600M steps 3 TByte 6,000,000 PFlop (70 days) Break- out Photon transport, MHD, coupled to accre1on 1,000 km 1,000,000 km 1 ms 200 s AMR: 15 levels 100M cells 6M steps 25 TByte 300,000 PFlop (3 days) A9erglow Photon transport/ absorp1on, nuclear decay Even larger / months Monte Carlo????
5
6 30 3 = (30+3+3) 3 = = x 1.7
7 Current Scalability Limitations Adap1ve Mesh Refinement (AMR) serializes 1me evolu1on of different levels Higher- order methods require many ghost zones Communica1on becomes bandwidth limited Prefer large memory per node, many cores per node Happy about recent architecture development
8 Current Performance Limitations Equa1ons are complex, need many Flop per cell (e.g. 10 kflop) Need large caches per core (both instruc1on and data) Okay, not so happy aher all Ques1on: Task- based mul1- threading?
9
10 A Hard Problem Publication Physics Model (EDL) Results Kranc Code Generator CaKernel Programming Abstractions Cactus-Carpet Computational Infrastructure Parallel Programming Environment
11
12
13
14 AMR Bounding Box Algebra Need to describe shapes of refined regions bbox: rectangular region of grid points bboxset: set of bboxes, arbitrary shape How to implement set opera1ons efficiently? (intersec1on, union, ) How to turn into list of bboxes? [h+p:// science/ar1cle/pii/s ]
15 Deriva1ve contains only key points of original shape x deriva1ve xy deriva1ve
16 Implementation To store bboxset, calculate deriva1ve, and store result in tree structure To apply opera1ons (union, intersec1on, ), use sweeping algorithm: Restore full bboxset on the fly Apply opera1on Calculate deriva1ve again Complexity: O(N log N) N: number of regions (not number of points!)
17 Time per grid point RHS [µs] Weak scaling benchmarks (GR + hydro) Einstein Toolkit benchmark: TOV (unigrid) Blue Waters Hopper Stampede Vesta (BG/Q) cores Time per grid point RHS [µs] Einstein Toolkit benchmark: TOV (9 levels) Blue Waters Stampede cores
18 Dynamic Loop Optimization [Compare Kokkos presenta1on] Cactus contains a module LoopControl that provides an abstrac1on for traversing loops: CCTK_LOOP3(loopName, i,j,k, imin,jmin,kmin, imax,jmax,kmax) // loop body CCTK_ENDLOOP3(loopName) Equivalent to 3D nest of for loops
19 Dynamic Loop Optimization LoopControl exposes loop- level parallelism Hierarchy of loop traversal strategies: 1. Coarse- grained mul1threading (different caches) 2. Itera1ng over 1les 3. Itera1ng within 1les (each fits into cache) 4. Fine- grained mul1threading (aka SMT) (shared cache) 5. (SIMD vectoriza1on)
20 Dynamic Loop Optimization LoopControl employs dynamic op1miza1ons Loop traversal depends on run- 1me parameters (e.g. thread decomposi1on, 1le sizes) Loop execu1on is profiled Parameter setngs are automa1cally improved at run 1me: Random- Restart Hill Climbing Profiling and op1miza1on transparent to user code Only minor source modifica1on (replace for by CCTK_LOOP3 ) Implemented via macros
21 Random- Restart Hill Climbing For each loop setup, remember current best parameters Hill Climbing: examine neighbouring parameter choices Random Restart: from 1me to 1me, randomly choose new parameters, and backtrack if necessary Goal is to reduce run 1me, not to find best parameter choice! Bad parameter choices can be 10x slower Be+er to find and use mediocre parameters many 1mes than a bad choice just a few 1mes Typical improvements (for our code): 10% to 20%
22 Dynamic Optimization vs. Auto- Tuning Auto- Tuning: Determine best loop parameters ahead of 1me Advantage: Actually finds best parameters Disadvantage: Specific to machine compiler version code version grid size
23 SIMD Vectorization SIMD vectoriza1on is important Compilers ohen fail To make them succeed, one needs to add annota1ons To produce be+er code, one needs to add more annota1ons However, some code proper1es currently cannot be described via annota1ons Need portable code avoid or abstract architecture- dependent or compiler- dependent mechanisms
24 SIMD Intrinsics Original loop: for (int i=0; i<n; ++i) { a[i] = b[i] * c[i] + d[i]; } #include <emmintrin.h> for (int i=0; i<n; i+=2) { m128d ai, bi, ci, di; bi = _mm_load_pd(&b[i]); ci = _mm_load_pd(&c[i]); di = _mm_load_pd(&d[i]); ai = _mm_add_pd(_mm_mul_pd(bi, ci), ci); _mm_store_pd(&a[i], ai); } #include <builtins.h> SSE: for (int i=0; i<n; i+=4) { vector4double ai, bi, ci, di; bi = vec_lda(0, &b[i]); ci = vec_lda(0, &c[i]); di = vec_lda(0, &d[i]); QPX: ai = vec_madd(bi, ci, di); vec_sta(ai, 0, &a[i]); }
25 SIMD Vectorization API (for Stencil- Based Codes) Implemented in Cactus in module Vectors 1. Data types for vectors of double, int, bool 2. Arithme1c opera1ons (+ - * /?: etc.) 3. Math func1ons (sqrt sin cos exp) 4. Memory load/store opera1ons Aligned/unaligned Par1al access (masks) Cache bypass 5. Helper func1ons for itera1ng over arrays/stencil opera1ons Array index calcula1ons Mask genera1on OpenCL, OpenMP
26 Finite Difference Stencil Example for (int i=1; i<n-1; ++i) { a[i] = 0.5 * (b[i+1] - b[i-1]); } #include <vectors.h> VEC_ITERATE(i, 1, N-1) { CCTK_REAL_VEC ai, bim, bip; bim = vec_loadu_off(-1, &b[i-1]); bip = vec_loadu_off(+1, &b[i+1]); ai = vec_mul(vec_set1(0.5), vec_sub(bip, bim)); vec_store_nta_partial(&a[i], ai); } Code transforma1on is straighworward Can provide addi1onal informa1on to improve performance
27 Conclusion Automated code genera1on: Crea1ng applica1on modules from equa1ons/stencil descrip1ons Cactus sohware framework for portability Three performance/op1miza1on abstrac1ons: Efficient bounding box set algebra Dynamic loop op1miza1ons Explicit vectoriza1on API
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.
More informationCactus Framework: Scaling and Lessons Learnt
Cactus Framework: Scaling and Lessons Learnt Gabrielle Allen, Erik Schnetter, Jian Tao Center for Computation & Technology Departments of Computer Science & Physics Louisiana State University Also: Christian
More informationFourteen years of Cactus Community
Fourteen years of Cactus Community Frank Löffler Center for Computation and Technology Louisiana State University, Baton Rouge, LA September 6th 2012 Outline Motivation scenario from Astrophysics Cactus
More informationWhat is Cactus? Cactus is a framework for developing portable, modular applications
What is Cactus? Cactus is a framework for developing portable, modular applications What is Cactus? Cactus is a framework for developing portable, modular applications focusing, although not exclusively,
More informationA PSyclone perspec.ve of the big picture. Rupert Ford STFC Hartree Centre
A PSyclone perspec.ve of the big picture Rupert Ford STFC Hartree Centre Requirement I Maintainable so,ware maintain codes in a way that subject ma7er experts can s:ll modify the code Leslie Hart from
More informationAdaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA
Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA
More informationMPI & OpenMP Mixed Hybrid Programming
MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why
More informationECSE 425 Lecture 25: Mul1- threading
ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:
More informationPhysis: An Implicitly Parallel Framework for Stencil Computa;ons
Physis: An Implicitly Parallel Framework for Stencil Computa;ons Naoya Maruyama RIKEN AICS (Formerly at Tokyo Tech) GTC12, May 2012 1 è Good performance with low programmer produc;vity Mul;- GPU Applica;on
More informationGe#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017
Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)
More informationPor$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain
Por$ng Monte Carlo Algorithms to the GPU Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain 1 Outline Introduc$on to GPUs Why they are interes$ng How they operate Pros and cons
More informationMassively Parallel Phase Field Simulations using HPC Framework walberla
Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich
More informationFix- point engine in Z3. Krystof Hoder Nikolaj Bjorner Leonardo de Moura
μz Fix- point engine in Z3 Krystof Hoder Nikolaj Bjorner Leonardo de Moura Mo?va?on Horn EPR applica?ons (Datalog) Points- to analysis Security analysis Deduc?ve data- bases and knowledge bases (Yago)
More informationSoft GPGPUs for Embedded FPGAS: An Architectural Evaluation
Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts
More informationProfiling & Tuning Applica1ons. CUDA Course July István Reguly
Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,
More informationLecture 1 Introduc-on
Lecture 1 Introduc-on What would you get out of this course? Structure of a Compiler Op9miza9on Example 15-745: Introduc9on 1 What Do Compilers Do? 1. Translate one language into another e.g., convert
More informationParaiso project for automated genera0on and tuning of hyperbolic par0al differen0al equa0ons solvers for parallel and accelerated computers in Haskell
check! hip://paraiso- lang.org/wiki/ Paraiso project for automated genera0on and tuning of hyperbolic par0al differen0al equa0ons solvers for parallel and accelerated computers in Haskell Takayuki Muranushi
More informationSuper Instruction Architecture for Heterogeneous Systems. Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders
Super Instruction Architecture for Heterogeneous Systems Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders Super Instruc,on Architecture Mo,vated by Computa,onal Chemistry Coupled
More informationTiDA: High Level Programming Abstrac8ons for Data Locality Management
h#p://parcorelab.ku.edu.tr TiDA: High Level Programming Abstrac8ons for Data Locality Management Didem Unat, Muhammed Nufail Farooqi, Burak Bastem Koç University, Turkey Tan Nguyen, Weiqun Zhang, George
More informationResults from the Early Science High Speed Combus:on and Detona:on Project
Results from the Early Science High Speed Combus:on and Detona:on Project Alexei Khokhlov, University of Chicago Joanna Aus:n, University of Illinois Charles Bacon, Argonne Na:onal Laboratory Andrew Knisely,
More informationScientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program
Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program Implementing Hybrid Parallelism in FLASH Christopher Daley 1 2 Vitali Morozov 1 Dongwook Lee 2 Anshu Dubey 1 2 Jonathon
More informationCSE Opera,ng System Principles
CSE 30341 Opera,ng System Principles Lecture 5 Processes / Threads Recap Processes What is a process? What is in a process control bloc? Contrast stac, heap, data, text. What are process states? Which
More informationImplemen'ng BCs in Legion- S3D
Implemen'ng BCs in Legion- S3D Hemanth Kolla Sandia Na0onal Laboratories Legion Bootcamp December 7 th, 2015 Stanford, CA Background S3D is an explicit finite difference PDE solver for turbulent combus0on:
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationVulnerability Analysis (III): Sta8c Analysis
Computer Security Course. Vulnerability Analysis (III): Sta8c Analysis Slide credit: Vijay D Silva 1 Efficiency of Symbolic Execu8on 2 A Sta8c Analysis Analogy 3 Syntac8c Analysis 4 Seman8cs- Based Analysis
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationUsing Graph- Based Characteriza4on for Predic4ve Modeling of Vectorizable Loop Nests
Using Graph- Based Characteriza4on for Predic4ve Modeling of Vectorizable Loop Nests William Killian PhD Prelimary Exam Presenta4on Department of Computer and Informa4on Science CommiIee John Cavazos and
More informationSEDA An architecture for Well Condi6oned, scalable Internet Services
SEDA An architecture for Well Condi6oned, scalable Internet Services Ma= Welsh, David Culler, and Eric Brewer University of California, Berkeley Symposium on Operating Systems Principles (SOSP), October
More informationServer Side Applications (i.e., public/private Clouds and HPC)
Server Side Applications (i.e., public/private Clouds and HPC) Kathy Yelick UC Berkeley Lawrence Berkeley National Laboratory Proposed DOE Exascale Science Problems Accelerators Carbon Capture Cosmology
More informationORAP Forum October 10, 2013
Towards Petaflop simulations of core collapse supernovae ORAP Forum October 10, 2013 Andreas Marek 1 together with Markus Rampp 1, Florian Hanke 2, and Thomas Janka 2 1 Rechenzentrum der Max-Planck-Gesellschaft
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi IXPUG 14 Lars Koesterke Acknowledgements Thanks/kudos to: Sponsor: National Science Foundation NSF Grant #OCI-1134872 Stampede Award, Enabling, Enhancing, and Extending Petascale
More informationCSSE232 Computer Architecture I. Datapath
CSSE232 Computer Architecture I Datapath Class Status Reading Sec;ons 4.1-3 Project Project group milestone assigned Indicate who you want to work with Indicate who you don t want to work with Due next
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationW1005 Intro to CS and Programming in MATLAB. Brief History of Compu?ng. Fall 2014 Instructor: Ilia Vovsha. hip://www.cs.columbia.
W1005 Intro to CS and Programming in MATLAB Brief History of Compu?ng Fall 2014 Instructor: Ilia Vovsha hip://www.cs.columbia.edu/~vovsha/w1005 Computer Philosophy Computer is a (electronic digital) device
More informationOpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2
2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James
More informationVolume Rendering, pt 1. Hank Childs, University of Oregon
Volume Rendering, pt 1 Hank Childs, University of Oregon Announcements No class Friday Grad students: No project 8G s8ll need to do short presenta8ons Come to OH and let s chat Plo$ng Techniques X- rays
More informationAr#ficial Intelligence
Ar#ficial Intelligence Advanced Searching Prof Alexiei Dingli Gene#c Algorithms Charles Darwin Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for
More informationFor example, could you make the XNA func8ons yourself?
1 For example, could you make the XNA func8ons yourself? For the second assignment you need to know about the en8re process of using the graphics hardware. You will use shaders which play a vital role
More informationCompiler Optimization Intermediate Representation
Compiler Optimization Intermediate Representation Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology
More informationFundamental Programming Principles: Variables and Func6ons
Fundamental Programming Principles: Variables and Func6ons Beyond the Mouse GEOS 436/636 Jeff Freymueller, Sep 12, 2017 The Uncomfortable Truths Well, http://xkcd.com/568 (April 13, 2009) Topics for Today
More informationEnergy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich
Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Revolu'onizing the Datacenter Datacenter Join the Conversa'on #OpenPOWERSummit Towards highly efficient data
More informationFirst: Shameless Adver2sing
Agenda A Shameless self promo2on Introduc2on to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy The Cuda Memory Hierarchy Mapping Cuda to Nvidia GPUs As much of the OpenCL informa2on as I can
More informationSta$c Analysis Dataflow Analysis
Sta$c Analysis Dataflow Analysis Roadmap Overview. Four Analysis Examples. Analysis Framework Soot. Theore>cal Abstrac>on of Dataflow Analysis. Inter- procedure Analysis. Taint Analysis. Overview Sta>c
More informationLOOP PARALLELIZATION!
PROGRAMMING LANGUAGES LABORATORY! Universidade Federal de Minas Gerais - Department of Computer Science LOOP PARALLELIZATION! PROGRAM ANALYSIS AND OPTIMIZATION DCC888! Fernando Magno Quintão Pereira! fernando@dcc.ufmg.br
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationFault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013
Fault Tolerant Runtime Research @ ANL Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013 Brief History of FT Checkpoint/Restart (C/R) has been around for quite a while Guards against
More informationCrea?ng Cloud Apps with Oracle Applica?on Builder Cloud Service
Crea?ng Cloud Apps with Oracle Applica?on Builder Cloud Service Shay Shmeltzer Director of Product Management Oracle Development Tools and Frameworks @JDevShay hpp://blogs.oracle.com/shay This App you
More informationOp#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD
Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
More informationProgram Op*miza*on and Analysis. Chenyang Lu CSE 467S
Program Op*miza*on and Analysis Chenyang Lu CSE 467S 1 Program Transforma*on op#mize Analyze HLL compile assembly assemble Physical Address Rela5ve Address assembly object load executable link Absolute
More informationInstructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #13. Warehouse Scale Computer
CS 61C: Great Ideas in Computer Architecture Cache Performance and Parallelism Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13 10/8/13 Fall 2013 - - Lecture #13 1 New- School Machine
More informationExperiences with ENZO on the Intel Many Integrated Core Architecture
Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationMacro Assembler. Defini3on from h6p://www.computeruser.com
The Macro Assembler Macro Assembler Defini3on from h6p://www.computeruser.com A program that translates assembly language instruc3ons into machine code and which the programmer can use to define macro
More informationMPI Performance Analysis Trace Analyzer and Collector
MPI Performance Analysis Trace Analyzer and Collector Berk ONAT İTÜ Bilişim Enstitüsü 19 Haziran 2012 Outline MPI Performance Analyzing Defini6ons: Profiling Defini6ons: Tracing Intel Trace Analyzer Lab:
More informationBioinforma)cs Resources - NoSQL -
Bioinforma)cs Resources - NoSQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J. Reeb Ins)tut für Informa)k I12 Short SQL Recap schema typed data tables defined layout space consump)on is computable
More informationTutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!
Tutorial Outline Time Topic! 8:30 am 9:00 am! Introduction! 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin 10:00 am 10:30 am! Break! 10:30 am 11:00 am! Workload Characterization Tool: WIICA! 11:00
More informationAdvanced OpenMP Vectoriza?on
UT Aus?n Advanced OpenMP Vectoriza?on TACC TACC OpenMP Team milfeld/lars/agomez@tacc.utexas.edu These slides & Labs:?nyurl.com/tacc- openmp Learning objec?ve Vectoriza?on: what is that? Past, present and
More informationPerformance Measurement
ECPE 170 Jeff Shafer University of the Pacific Performance Measurement 2 Lab Schedule Ac?vi?es Today Background discussion Lab 5 Performance Measurement Wednesday Lab 5 Performance Measurement Friday Lab
More informationPredic've Modeling in a Polyhedral Op'miza'on Space
Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationInstructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #16. Warehouse Scale Computer
CS 61C: Great Ideas in Computer Architecture OpenMP Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13 10/23/13 Fall 2013 - - Lecture #16 1 New- School Machine Structures (It s a bit more
More informationCOL 380: Introduc1on to Parallel & Distributed Programming. Lecture 1 Course Overview + Introduc1on to Concurrency. Subodh Sharma
COL 380: Introduc1on to Parallel & Distributed Programming Lecture 1 Course Overview + Introduc1on to Concurrency Subodh Sharma Indian Ins1tute of Technology Delhi Credits Material derived from Peter Pacheco:
More informationCS 465 Final Review. Fall 2017 Prof. Daniel Menasce
CS 465 Final Review Fall 2017 Prof. Daniel Menasce Ques@ons What are the types of hazards in a datapath and how each of them can be mi@gated? State and explain some of the methods used to deal with branch
More informationImproving Uintah s Scalability Through the Use of Portable
Improving Uintah s Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks John Holmen1, Alan Humphrey1, Daniel Sunderland2, Martin Berzins1 University of Utah1 Sandia National Laboratories2
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationParallel Algorithms: Adaptive Mesh Refinement (AMR) method and its implementation
Parallel Algorithms: Adaptive Mesh Refinement (AMR) method and its implementation Massimiliano Guarrasi m.guarrasi@cineca.it Super Computing Applications and Innovation Department AMR - Introduction Solving
More informationVirtualization. Introduction. Why we interested? 11/28/15. Virtualiza5on provide an abstract environment to run applica5ons.
Virtualization Yifu Rong Introduction Virtualiza5on provide an abstract environment to run applica5ons. Virtualiza5on technologies have a long trail in the history of computer science. Why we interested?
More informationTurbo Boost Up, AVX Clock Down: Complica;ons for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complica;ons for Scaling Tests Steve Lantz 12/15/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency hrp://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationAgenda. Address vs. Value Consider memory to be a single huge array. Review. Pointer Syntax. Pointers 9/9/12
Agenda CS 61C: Great Ideas in Computer Architecture Introduc;on to C, Part II Instructors: Krste Asanovic Randy H. Katz hep://inst.eecs.berkeley.edu/~cs61c/f12 Review Pointers Administrivia Arrays Technology
More informationCS101: Fundamentals of Computer Programming. Dr. Tejada www-bcf.usc.edu/~stejada Week 1 Basic Elements of C++
CS101: Fundamentals of Computer Programming Dr. Tejada stejada@usc.edu www-bcf.usc.edu/~stejada Week 1 Basic Elements of C++ 10 Stacks of Coins You have 10 stacks with 10 coins each that look and feel
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationWelcome and introduction to SU 2
Welcome and introduction to SU 2 SU 2 Release Version 2. Workshop Stanford University Tuesday, January 5 th, 23 Dr. F. Palacios and Prof. J. J. Alonso Department of Aeronautics & Astronautics Stanford
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationUser manual of STYLE WiFi Connec7on and Opera7on of imos STYLE app. (ios & Android version)
User manual of STYLE WiFi Connec7on and Opera7on of imos STYLE app (ios & Android version) 1 WiFi connec7on (light fixture) 1. Before the STYLE is connected to your WiFi, the panel will show a sta7c green
More informationAsaf Cidon, Assaf Eisenman, Mohammad Alizadeh and Sachin KaH
Cli$anger: Scaling Performance Cliffs in Memory Caches [NSDI 2016] Cache OS: Data Center Dynamic Cache Management Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh and Sachin KaH 1 Key-Value Caches are Essen1al
More informationFusion PIC Code Performance Analysis on the Cori KNL System. T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel!
Fusion PIC Code Performance Analysis on the Cori KNL System T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel! tkoskela@lbl.gov May 18, 2017-1- Outline Introduc3on to magne3c fusion
More informationModel Transforma.on. Krzysztof Czarnecki Genera.ve So:ware Development Lab University of Waterloo, Canada gsd.uwaterloo.ca
Model Transforma.on Krzysztof Czarnecki Genera.ve So:ware Development Lab University of Waterloo, Canada gsd.uwaterloo.ca Modeling Wizards Summer School, Oct. 1, 2010, Oslo, Norway What is model transforma.on?
More informationCoupling of STAR-CCM+ to Other Theoretical or Numerical Solutions. Milovan Perić
Coupling of STAR-CCM+ to Other Theoretical or Numerical Solutions Milovan Perić Contents The need to couple STAR-CCM+ with other theoretical or numerical solutions Coupling approaches: surface and volume
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationImplicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC
Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma
More informationCode Generators for Stencil Auto-tuning
Code Generators for Stencil Auto-tuning Shoaib Kamil with Cy Chan, Sam Williams, Kaushik Datta, John Shalf, Katherine Yelick, Jim Demmel, Leonid Oliker Diagnosing Power/Performance Correctness Where this
More informationAgenda. General Organiza/on and architecture Structural/func/onal view of a computer Evolu/on/brief history of computer.
UNIT I: OVERVIEW Agenda General Organiza/on and architecture Structural/func/onal view of a computer Evolu/on/brief history of computer. Architecture & Organiza/on Computer Architecture is those abributes
More informationA Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System
A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal
More informationNAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale
NAMD at Extreme Scale Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale Overview NAMD description Power7 Tuning Support for
More informationwalberla: Developing a Massively Parallel HPC Framework
walberla: Developing a Massively Parallel HPC Framework SIAM CS&E 2013, Boston February 26, 2013 Florian Schornbaum*, Christian Godenschwager*, Martin Bauer*, Matthias Markl, Ulrich Rüde* *Chair for System
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationNetSlices: Scalable Mul/- Core Packet Processing in User- Space
NetSlices: Scalable Mul/- Core Packet Processing in - Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee Packet Processors Essen/al for evolving networks Sophis/cated
More informationMain Points. Address Transla+on Concept. Flexible Address Transla+on. Efficient Address Transla+on
Address Transla+on Main Points Address Transla+on Concept How do we convert a virtual address to a physical address? Flexible Address Transla+on Segmenta+on Paging Mul+level transla+on Efficient Address
More informationUPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2
Illiac UPCRC Petascale computing Gigascale System Research Center Cloud Computing Testbed (CCT) 2 www.parallel.illinois.edu Mul2 Core: All Computers Are Now Parallel We con'nue to have more transistors
More informationarxiv: v1 [physics.comp-ph] 24 Jul 2013
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation arxiv:1307.6488v1 [physics.comp-ph] 24 Jul 2013 Marek Blazewicz 1,2,, Ian Hinder 3, David M. Koppelman 4,5,
More informationExecu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs
Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs Omid Mashayekhi Hang Qu Chinmayee Shah Philip Levis July 13, 2017 2 Cloud Frameworks SQL Streaming Machine Learning
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationWays to implement a language
Interpreters Implemen+ng PLs Most of the course is learning fundamental concepts for using PLs Syntax vs. seman+cs vs. idioms Powerful constructs like closures, first- class objects, iterators (streams),
More informationRAMSES on the GPU: An OpenACC-Based Approach
RAMSES on the GPU: An OpenACC-Based Approach Claudio Gheller (ETHZ-CSCS) Giacomo Rosilho de Souza (EPFL Lausanne) Romain Teyssier (University of Zurich) Markus Wetzstein (ETHZ-CSCS) PRACE-2IP project EU
More informationCSE 1310: Introduction Mariottini UT Arlington
Kind of obvious, but a computer is something that does computa0on. What is interes8ng in it is what is going to be computed. In the 1960 s, when computers were becoming popular, they were commonly called
More information