SUPPORT FOR ADVANCED COMPUTING IN PROJECT CHRONO
|
|
- Job Ellis
- 5 years ago
- Views:
Transcription
1 SUPPORT FOR ADVANCED COMPUTING IN PROJECT CHRONO Dan Negrut Vilas Associate Professor NVIDIA CUDA Fellow Simulation-Based Engineering Lab University of Wisconsin-Madison December 9, 2015
2 Acknowledgement Funding for Project Chrono comes from US Army TARDEC Ends in September 2016 [looking for organizations to partner with for transfer of technology and joint projects] 2
3 Overview Part 1: discussion of two trends in computing Part 2: how we position Chrono to accommodate future trends in computing 3
4 The Price of 1 Gflop/second 1961: Combine 17 million IBM-1620 computers At $64K apiece, when adjusted for inflation, this would cost $8.3 trillion 2000: About $1, : 8 cents [wikipedia ] 4
5 The inside of a computer is as dumb as hell but it goes like mad. --Richard Feynman 5
6 Adopting a Positive Outlook The inside of a computer goes like mad but needs some hand holding. 6
7 First Trend Discussed Here: Memory Speed 3D Memory A major breakthrough 7
8 Basic Fact, Speed of Execution: Math Doesn t Matter void somefunction(double* a, double* b, unsigned int arrsize) { double dummy[3]; } dummy[0] = sin(a[1]); dummy[1] = log(fabs(a[2])) + sqrt(2.+dummy[0]); dummy[2] = cos(b[1]) + exp(b[0]); a[0] = dummy[1]; b[0] = dummy[2]; // and so on... void somefunction(double* a, double* b, unsigned int arrsize) { double dummy[3]; } dummy[0] a[1] dummy[1] a[2] dummy[2] b[1] and b[0] // and so on... 8
9 Why Math Operations Don t Count Memory speed almost always dictates performance of computation One transaction to GPU global memory: 400 clock cycles 32 fused multiply-add operations; i.e., 64 operations: 1 clock cycle c = α c + b Bottom line: 100X more expensive to move data where is needed 9
10 Memory Speed: Hard Nut to Crack Historically, memory speed increasing at a rate of approx. 1.07/year Historically, processors improved at faster rates 1.25/year ( ) 1.52/year ( ) 1.20/year ( ) Growing gap between memory speed and processing speed 10
11 Memory Speed: Widening of the Processor-DRAM Performance Gap Courtesy of Elsevier, Computer Architecture, Hennessey and Patterson, fourth edition 11 11
12 3D Stacked Memory [future looks quite bright] SK Hynix s High Bandwidth Memory (HBM) Developed by AMD and SK Hynix 1st Generation (HBM1) introduced in AMD Fiji GPUs 1GB & 128GB/s per stack AMD Radeon R9 Fury X: had four stacks 4GB & 512GB/s [AMD] 2nd Generation (HBM2) will be used in NVIDIA Pascal and AMD Arctic Island GPUs 2 GB & 256GB/s bandwidth per stack NVIDIA Pascal reported to have 1TB/s memory bandwidth 12
13 3D Stacked Memory [ leap in technology ] OLD EMERGING [AMD] 13
14 3D Stacked Memory More power efficient Electrons move shorter distances Less power wasted moving data Smaller memory footprint More memory can be packed into space [AMD] 14
15 15
16 Memory Speeds in a CPU-GPU System CPU Core Latency: Low 80GB/s Cache GPU cores GB/s GB/s Latency: Medium Low Latency: Rel.Low Infiniband to Next Node 6GB/s Latency: High System Memory 8-16 GB/s Latency: Medium GPU Memory 16 16
17 Second Trend Discussed Here: Moore s Law Number of transistors per unit area has been steadily going up ILP and Clock Speed have stagnated 17
18 Intel Roadmap, and Relevance to Us nm Tick: Ivy Bridge Tock: Haswell nm Tick: Broadwell Tock: Skylake nm Refresh Kaby Lake nm Tick: Cannonlake (delayed to 2 nd Half 2017) nm nm 2023??? (carbon nanotubes?) Happening now: Moore s law moving from month cycle to month cycle for the first time in 50 years 18
19 Transistor Densities Still Going Up Although not as fast as before, transistor densities are still going up Consequence: lots of cores in one chip CPU Cores: 18 today, probably 32 in two-three years GPU Scalar Processors: 3,000 today (Maxwell), probably 4,500 in two years (Pascal) Intel Xeon Phi : 61 today, very likely close to 200 in two years 19
20 Parallel Computing: Some Black Spots More transistors = More computational units 2015 Vintage: 18-core Xeon Haswell-EX E V3 5.6 billion transistors ($7200) Black silicon: owing to high density and power leaks, not able to fully power these chips Black silicon: transistors that today don t get used and are dead weight Dennard s scaling started to break down at the end of last decade Dennard s law is the secrete sauce for Moore s law 20
21 Lots of Cores: There s More Than Meets the Eye Solutions rarely scale beyond 20 cores when using shared memory Cache coherence and NUMA slow things down I have 32 cores and only see the net effect of 20 of them??? What if I have one workstation with four sockets; i.e, 128 cores can I only scale up to 20??? It looks like that shared memory solutions don t scale well 21
22 Scaling, with Lots of Cores: Via Distributed Memory To scale, a different parallel programming paradigm needed: distributed memory Distributed memory eliminates cache coherence issues Also good since you can solve very large problems lots of memory available to user Why not always do this? Distributed memory solution calls for major code re-write If not implemented well, distributed memory solution has high data access latencies 1000X higher than accessing memory on a workstation 22
23 Distributed Memory, Good for Long Run Though Five to six years from now, it s not clear what will replace Moore s law No technology yet to continue past steady increase in core count Can t improve anymore speeds by use of more cores on one workstation Distributed memory is the path towards running by drawing on multiple workstations Called nodes 23
24 Project Chrono Goal Solve one billion degrees of freedom by the time we get together in December
25 Positioning Chrono for Advanced Computing Cluster Node Coprocessors/Accelerators Socket Core Hyper-Threads Superscalar Pipeline Vector Group of nodes communicating through fast interconnect Group of processors communicating through shared memory Special compute devices attached to the local node through special interconnect Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units [Intel] We have full control We have little to no control 25
26 HPC in Computational Dynamics: Is MPI the Way to Go? Applications are getting more sophisticated Multi-scale, multi-module, multi-physics The traditional approach based on MPI not attractive Working on 100s of nodes is run of the mill in HPC Load imbalance emerges as a big issue for some apps 26
27 27
28 28
29 29
30 30
31 MPI or Charm++? Charm++ is a generalized approach to writing parallel programs An alternative to the likes of MPI, Chapel, UPC, etc. Charm++, three facets A style of writing parallel programs An ecosystem that facilitates the act of writing parallel programs Debugger, profiler, ability to define own load balancing, etc. A runtime system 31
32 Charm++ Attribute: Overdecomposition Decompose the work units & data units into many more pieces than execution units Cores/Nodes/.. Why do this? Central idea: oversubscription of the hardware Hide memory latency w/ useful execution This oversubscription idea is a general tenet Done by the GPU 32
33 Charm++ Attribute: Migratability Make the work and data units on previous slide migratable at runtime That is, the programmer or runtime can move them from execution unit (PE, from processing element) to execution unit From PE to PE, that is Consequences for the app-developer Communication must now be addressed to logical units with global names, not to physical processors But this is a good thing Consequences for the runtime system (RTS) Must keep track of where each unit is Naming and location management 33
34 Positioning Chrono for Advanced Computing Cluster Node Coprocessors/Accelerators Socket Core Hyper-Threads Superscalar Pipeline Vector Group of nodes communicating through fast interconnect Group of processors communicating through shared memory Special compute devices attached to the local node through special interconnect Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units [Intel] We have full control We have little to no control 35
35 Positioning Chrono for Advanced Computing Cluster Node Coprocessors/Accelerators Socket Core Hyper-Threads Superscalar Pipeline Vector Group of nodes communicating through fast interconnect Group of processors communicating through shared memory Special compute devices attached to the local node through special interconnect Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units [Intel] We have full control We have little to no control 36
36 HMMWV on Deformable Terrain. Year:
37 Chrono GPU: HMMWV on Deformable Terrain. Year:
38 HMMWV on Discrete Terrain k rigid spheres Length of simulation: 15 seconds Hardware used: CPU (Intel) Multicore, based on OpenMP Integration time step: 0.001s Velocity Based Complementarity 17 seconds per time step Simulation time: ~2.5 days 2015 ~1.5 million rigid spheres Length of simulation: 15 seconds Hardware: GPU (NVIDIA) Tesla K40X Integration time step: s Position Based Dynamics 0.3 seconds per time step Simulation time: ~2.5 hours 2015 Simulation: although 5X more bodies, runs about 25 times faster 39
39 Positioning Chrono for Advanced Computing Cluster Node Coprocessors/Accelerators Socket Core Hyper-Threads Superscalar Pipeline Vector Group of nodes communicating through fast interconnect Group of processors communicating through shared memory Special compute devices attached to the local node through special interconnect Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units [Intel] We have full control We have little to no control 40
40 4 wide add operation (SSE 1.0) C++ code m128 Add (const m128 &x, const m128 &y){ return _mm_add_ps(x, y); } mm128 z, x, y; x = _mm_set_ps(1.0f,2.0f,3.0f,4.0f); y = _mm_set_ps(4.0f,3.0f,2.0f,1.0f); z = Add(x,y); x x3 x2 x1 x y y3 y2 y1 y0 = = = = = gcc S O3 sse_example.cpp z z3 z2 z1 z0 Assembly Z10AddRKDv4_fS1 Z10AddRKDv4_fS1_: movaps (%rsi), %xmm0 # move y into SSE register xmm0 addps (%rdi), %xmm0 # add x with y and store xmm0 ret # xmm0 is returned as result [Hammad] 41
41 Conclusions, Chrono::HPC Moore s law reaching terminus in six years: distributed memory solutions all we have left Looking at all of the above opportunities to speed up large simulations in Chrono Large simulations in Chrono: billion degree of freedom dynamic systems Fluid solid interaction Granular material (high/low saturation) Large nonlinear FEA Aiming to present billion DOF simulation in Chrono at Fall 2016 MaGIC meeting 42
42 Thank You. 43
Fra superdatamaskiner til grafikkprosessorer og
Fra superdatamaskiner til grafikkprosessorer og Brødtekst maskinlæring Prof. Anne C. Elster IDI HPC/Lab Parallel Computing: Personal perspective 1980 s: Concurrent and Parallel Pascal 1986: Intel ipsc
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationCHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison
CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison Support: Rapid Innovation Fund, U.S. Army TARDEC ASME
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More information7 Trends driving the Industry to Software-Defined Servers
7 Trends driving the Industry to Software-Defined Servers The Death of Moore s Law. The Birth of Software-Defined Servers It has been over 50 years since Gordon Moore saw that transistor density doubles
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications
ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications The Shift to Parallel Computing ILP Wall Power Wall Big Iron HPC Amdahl's Law September 18, 2015 Dan Negrut, 2015 ECE/ME/EMA/CS
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Execution Scheduling in CUDA Revisiting Memory Issues in CUDA February 17, 2011 Dan Negrut, 2011 ME964 UW-Madison Computers are useless. They
More informationNVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas
NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Modern CPUs Historical trends in CPU performance From Data processing in exascale class computer systems, C. Moore http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationCSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs
More informationECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications
ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications Final Project Related Issues Variable Sharing in OpenMP OpenMP synchronization issues OpenMP performance issues November 6, 2015
More informationThe Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center
The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.
More informationCS758: Multicore Programming
CS758: Multicore Programming Introduction Fall 2009 1 CS758 Credits Material for these slides has been contributed by Prof. Saman Amarasinghe, MIT Prof. Mark Hill, Wisconsin Prof. David Patterson, Berkeley
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More informationIntroduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29
Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationTwos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.
Twos Complement Signed Numbers IT 3123 Hardware and Software Concepts Modern Computer Implementations April 26 Notice: This session is being recorded. Copyright 2009 by Bob Brown http://xkcd.com/571/ Reminder:
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More information(software agnostic) Computational Considerations
(software agnostic) Computational Considerations The Issues CPU GPU Emerging - FPGA, Phi, Nervana Storage Networking CPU 2 Threads core core Processor/Chip Processor/Chip Computer CPU Threads vs. Cores
More informationHigh Performance Computing (HPC) Introduction
High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt
More informationTechnologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017
Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of
More informationWhat Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others
What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.
More informationUC Berkeley CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c Review! UC Berkeley CS61C : Machine Structures Lecture 28 Intra-machine Parallelism Parallelism is necessary for performance! It looks like itʼs It is the future of computing!
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationCS/EE 6810: Computer Architecture
CS/EE 6810: Computer Architecture Class format: Most lectures on YouTube *BEFORE* class Use class time for discussions, clarifications, problem-solving, assignments 1 Introduction Background: CS 3810 or
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationPreparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationEN105 : Computer architecture. Course overview J. CRENNE 2015/2016
EN105 : Computer architecture Course overview J. CRENNE 2015/2016 Schedule Cours Cours Cours Cours Cours Cours Cours Cours Cours Cours 2 CM 1 - Warmup CM 2 - Computer architecture CM 3 - CISC2RISC CM 4
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationBREAKING THE MEMORY WALL
BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power
More informationUC Berkeley CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 39 Intra-machine Parallelism 2010-04-30!!!Head TA Scott Beamer!!!www.cs.berkeley.edu/~sbeamer Old-Fashioned Mud-Slinging with
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationSolving Large Complex Problems. Efficient and Smart Solutions for Large Models
Solving Large Complex Problems Efficient and Smart Solutions for Large Models 1 ANSYS Structural Mechanics Solutions offers several techniques 2 Current trends in simulation show an increased need for
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationECE 588/688 Advanced Computer Architecture II
ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationAdministration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall
More informationWorld s most advanced data center accelerator for PCIe-based servers
NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying
More informationMulticore Computing and Scientific Discovery
scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationMICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE
MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE LEVERAGE OUR EXPERTISE sales@microway.com http://microway.com/tesla NUMBERSMASHER TESLA 4-GPU SERVER/WORKSTATION Flexible form factor 4 PCI-E GPUs + 3 additional
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationPRACE Autumn School Basic Programming Models
PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries
More informationAdministration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects
CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:
More informationIt's the end of the world as we know it
It's the end of the world as we know it Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Background Graduated as Valedictorian in Computer Science from Cardiff University
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationManycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.
phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this
More informationMaximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs
Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise
More informationECE 2162 Intro & Trends. Jun Yang Fall 2009
ECE 2162 Intro & Trends Jun Yang Fall 2009 Prerequisites CoE/ECE 0142: Computer Organization; or CoE/CS 1541: Introduction to Computer Architecture I will assume you have detailed knowledge of Pipelining
More informationComputer Architecture
Informatics 3 Computer Architecture Dr. Vijay Nagarajan Institute for Computing Systems Architecture, School of Informatics University of Edinburgh (thanks to Prof. Nigel Topham) General Information Instructor
More informationBig Data Systems on Future Hardware. Bingsheng He NUS Computing
Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationCIT 668: System Architecture
CIT 668: System Architecture Computer Systems Architecture I 1. System Components 2. Processor 3. Memory 4. Storage 5. Network 6. Operating System Topics Images courtesy of Majd F. Sakr or from Wikipedia
More informationExascale: challenges and opportunities in a power constrained world
Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department CINECA CINECA non profit Consortium, made
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationMulti-Core Microprocessor Chips: Motivation & Challenges
Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More information