The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009
|
|
- Emma Hodges
- 11 months ago
- Views:
Transcription
1 The Many-Core Revolution Understanding Change Alejandro Cabrera January 29, 2009
2 Disclaimer This presentation currently contains several claims requiring proper citations and a few images that may or may not be licensed under the Creative Commons. The dwarf twins later on are CC-compatible. In short, it is not ready for production use. You've been warned.
3 Acknowledgements Berkeley View: The bulk of the presentation is based off of this paper. NVIDIA: Their GPUs and spec-sheets provide some exciting numbers. Google: Image searching made easy. Tilera: Many-core CPU.
4 Overview Exciting Pictures, Exciting Numbers State of the Core Why Many-Core? Common Wisdoms Refuted Dwarves and Applications Programming Many-Core Discussion
5 Many-Core CPU: Tilera Pictures
6 Many-Core CPU: Tilera Numbers Memory Installed 2.5GB DDR2 Processor Clock 700 MHz
7 Many-Core CPU: Tilera Numbers Not many numbers available yet... Memory Installed 2.5GB DDR2 Processor Clock 700 MHz
8 Many-Core GPU: Nvidia GTX 295 Pictures
9 Many-Core GPU: Nvidia GTX 295 Numbers GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec Memory Specs Memory Clock 999 MHz Memory Config 1792 MB GDDR3 Memory Interface Width 896-bit Memory Bandwidth GB/s
10 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec Typical CPU has no more than 4 cores!
11 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 92.2 billion pixels per second...
12 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 92.2 billion pixels per second... A high-end monitor has a resolution of 2560 x 1600
13 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 92.2 billion pixels per second... A high-end monitor has a resolution of 2560 x 1600 That's...
14 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 92.2 billion pixels per second... A high-end monitor has a resolution of 2560 x 1600 That's million pixels
15 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 4.1 million << 92.2 billion You could re-draw an entire scene about 22,500 times per second!
16 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 4.1 million << 92.2 billion You could re-draw an entire scene about 22,500 times per second! (assumes trivial, flat scene)
17 Many-Core GPU: Nvidia GTX 295 Numbers GB/s Memory Specs Memory Clock 999 MHz Memory Config 1792 MB GDDR3 Memory Interface Width 896-bit Memory Bandwidth GB/s
18 Many-Core GPU: Nvidia GTX 295 Numbers GB/s ~1TB / 4 seconds Memory Specs Memory Clock 999 MHz Memory Config 1792 MB GDDR3 Memory Interface Width 896-bit Memory Bandwidth GB/s
19 Many-Core GPU: Nvidia GTX 295 Numbers GB/s ~1TB / 4 seconds Let's picture it. Memory Specs Memory Clock 999 MHz Memory Config 1792 MB GDDR3 Memory Interface Width 896-bit Memory Bandwidth GB/s
20 Many-Core GPU: Nvidia GTX 295 Numbers WARNING The following slides feature assumptions that have no basis in reality. *A disk cannot store data at 223.8GB/s (yet)
21 Many-Core GPU: Nvidia GTX seconds of data processing ~1 TB
22 Many-Core GPU: Nvidia GTX minute of data processing
23 Many-Core GPU: Nvidia GTX hour of data processing 3.5 form factor = 4 x 5.75 x 1 = 23in 3 15 x 60 disks filled in one hour 23in 3 x 15 x 60 = 20,700in 3 = 1,725ft 3 Height of world's tallest building: 1,730ft To fit all the data, would require a cube as wide, long, and tall as Sears (Willis) Tower!
24 State of the Core Where Are We Now? Sequential processors aren't getting any faster No free lunch via Moore's Law voucher Quad-core = commodity Parallel applications few and far between in consumer markets Look to scientific and enterprise computing Multi-process Google Chrome browser Vendors want more performance and they want it yesterday Often a selling point
25 State of the Core Where Are We Now? Graphics processors no longer take the back seat GPGPU (2002) CUDA v1.1 (2007) CUDA v3.0b (2010) Powerful accelerators mingling with CPU IBM Cell
26 Biggest problem: State of the Core Where Are We Now? How do we develop efficient, correct, scalable parallel components? How do we develop highly-parallel applications composed of those components? Components: Data structures Algorithms
27 Why Many-Core? Beyond Exciting Numbers It's not a victory parade towards a bright, new idea. It's a retreat from an even greater challenge. We can't make sequential processors faster without melting them (or exploding our energy bills). We still want to get faster as quickly as possible, so we pursue the most immediate solution towards that end. As a result, many of our conventional wisdoms acquired over previous decades of computing have been overturned.
28 Common Wisdoms (Refuted) Power vs. Transistors Old Wisdom Power is free, but transistors are expensive. New Wisdom Power is expensive, but transistors are free. We can fit as many transistors on a chip as we have power to turn them on!
29 Common Wisdoms (Refuted) Dynamic vs. Static Power Old Wisdom You should only worry about dynamic power (voltage scaling). New Wisdom For desktops and servers, static power leakage can be 40% of total power.
30 Common Wisdoms (Refuted) Hardware Errors Old Wisdom Uniprocessors are reliable internally. New Wisdom With transistor designs falling below 65nm scale, errors occur at quantum level.
31 Common Wisdoms (Refuted) Scaling Designs Old Wisdom Old successes guide future successes, so we need only build upon prior designs. New Wisdom As design size (nanometer) drops, a multitude of factors will stretch development time.
32 Common Wisdoms (Refuted) Architecture Research Old Wisdom Let academia evaluate experimental designs they can build their own chips. New Wisdom Academia can no longer afford tools required to build believable chips.
33 Common Wisdoms (Refuted) Bandwidth vs. Latency Old Wisdom Performance improvements latency drops and bandwidth increases New Wisdom Bandwidth improves exponentially compared to latency (memory wall)
34 Common Wisdoms (Refuted) Computation vs. Memory Access Old Wisdom Store common computations in tables arithmetic is slow. New Wisdom Re-compute needed results data storage is slow.
35 Common Wisdoms (Refuted) Instruction Level Parallelism Old Wisdom There's an abundance of ILP waiting to be found by compilers, architectures designs, VLIW... New Wisdom Diminishing returns on ILP.
36 Common Wisdoms (Refuted) Moore's Law Old Wisdom Performance doubles every 18 months on a uniprocessor. New Wisdom Power Wall + Memory Wall + ILP Wall greater than 60 months to maintain Moore's Law for uniprocessors
37 Common Wisdoms (Refuted) Why Parallelize? Old Wisdom Don't bother parallelizing Moore's Law promises it'll run faster in a couple of years, unmodified. New Wisdom It'll be a long time before an unmodified program gets faster.
38 Common Wisdoms (Refuted) Parallel Performance Value Old Wisdom If it doesn't scale linearly, trash it. New Wisdom Any performance scaling is better than none- it's the only way to get faster now!
39 Common Wisdoms Moore's Law II New Wisdom The number of cores available on a chip doubles every 18 months.
40 Classifying Parallelism Dwarves To better understand parallel applications, a series of areas where parallelism is commonly exploited were analyzed. These areas are dwarves, patterns of communication and computation that identify a category of application.
41 Classifying Parallelism The Seven Dwarves
42 The Seven Dwarves Dense Linear Algebra (BLAS) Sparse Linear Algebra (conjugate gradient) Spectral Methods (FFT, DSP) N-Body simulation Structured Grid (PDE solver) Unstructured Grid Monte Carlo (embarrassingly parallel why?) Independent events
43 The Six (Other) Dwarves Why more dwarves? Up and coming algorithmic techniques and application domains require parallelization Observed domains: Combinatorial logic (SHA, MD5, AES, cryptography) Graph traversal (BFS, A*, maximum network flow) Finite state machines Bayesian networks, Hidden Markov Models Machine learning Dynamic programming Backtrack, branch-and-bound
44 The Six (Other) Dwarves In particular, no approach is currently known for parallel evaluation of a finite state machine. How can a system be in multiple states at once? Thought to be embarrassingly sequential. This brings us to our crux how do we develop new parallel applications?
45 Programming Many-Core We have exciting equipment. GPUs, CPUs, accelerators... We have clear applications areas. Linear algebra, graph traversal, dynamic programming... How do we use those exciting machines to satisfy those application requirements?
46 Programming Many-Core #include <???> import??? from
47 Programming Many-Core #include <???> import??? from How did I get here...?
48 Programming Many-Core #include <???> import??? from and who is that behind me?
49 Parallel Programming Models There are many (more) details that may be necessary to manage in order to produce an efficient parallel application versus a sequential application. Programming models seek to simplify the management of one or more of the following: Task identification Task mapping Data distribution Communication mapping Synchronization
50 MPI Parallel Programming Models A Few Existing Examples Pthreads MapReduce OpenMP CUDA OpenCL
51 Parallel Programming Models A Few Existing Examples Model Task ID Task Mapping Data Distrib. Comm. Mapping Sync MPI explicit explicit explicit implicit implicit Pthreads explicit explicit implicit implicit explicit MapReduce explicit implicit implicit implicit explicit OpenMP implicit implicit implicit implicit implicit CUDA explicit explicit explicit implicit explicit OpenCL N/A N/A N/A N/A N/A
52 Parallel Programming Models Message passing: Pros: Harder to make mistakes Much upfront planning Highly verifiable model Cons: Difficult to learn Hardware not yet thought of widely as networked components Shared Memory Pros: Friendly learning curve Matches hardware layout Cons: May not scale Cache coherence... Easy to make mistakes
53 Looking to the Many-Core Future We need: Better parallel programming models Better compilers A set of primitives to build APIs from Thoroughly tested parallel components Libraries Tools
54 Looking to the Many-Core Future Many challenges await. However, thanks to the revised Moore's Law, we have much to look forward to. Now, more than ever before, we'll be able to make great advances in sciences depending on extensive computation. All we have to do is add another core.
55
56 We'll yet get to the core of this issue.
CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
Multimedia in Mobile Phones. Architectures and Trends Lund
Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson
CS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013
GPGPU on ARM Tom Gall, Gil Pitney, 30 th Oct 2013 Session Description This session will discuss the current state of the art of GPGPU technologies on ARM SoC systems. What standards are there? Where are
Carlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects
CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:
PFAC Library: GPU-Based String Matching Algorithm
PFAC Library: GPU-Based String Matching Algorithm Cheng-Hung Lin Lung-Sheng Chien Chen-Hsiung Liu Shih-Chieh Chang Wing-Kai Hon National Taiwan Normal University, Taipei, Taiwan National Tsing-Hua University,
Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice
CSE 392/CS 378: High-performance Computing: Principles and Practice Administration Professors: Keshav Pingali 4.126 ACES Email: pingali@cs.utexas.edu Jim Browne Email: browne@cs.utexas.edu Robert van de
CME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
Part IV. Review of hardware-trends for real-time ray tracing
Part IV Review of hardware-trends for real-time ray tracing Hardware Trends For Real-time Ray Tracing Philipp Slusallek Saarland University, Germany Large Model Visualization at Boeing CATIA Model of Boeing
Chapter 1: Introduction to Parallel Computing
Parallel and Distributed Computing Chapter 1: Introduction to Parallel Computing Jun Zhang Laboratory for High Performance Computing & Computer Simulation Department of Computer Science University of Kentucky
The Future of GPU Computing
The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist
Evaluation Of The Performance Of GPU Global Memory Coalescing
Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea
CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
How to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
Lect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
Performance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
Preparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2
Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving
Using CUDA to Accelerate Radar Image Processing
Using CUDA to Accelerate Radar Image Processing Aaron Rogan Richard Carande 9/23/2010 Approved for Public Release by the Air Force on 14 Sep 2010, Document Number 88 ABW-10-5006 Company Overview Neva Ridge
PowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
Introduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
Direct Rendering of Trimmed NURBS Surfaces
Direct Rendering of Trimmed NURBS Surfaces Hardware Graphics Pipeline 2/ 81 Hardware Graphics Pipeline GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target Screen Extended
Tutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
n N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
GPU Architecture Overview
Course Overview Follow-Up GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Read before or after lecture? Project vs. final project Closed vs. open source Feedback
Modern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group
How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group avi.mendelson@intel.com 1 Disclaimer No Intel proprietary information is disclosed. Every future estimate
Multi-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
Introduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
Big Orange Bramble. August 09, 2016
Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University
Accelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.
Twos Complement Signed Numbers IT 3123 Hardware and Software Concepts Modern Computer Implementations April 26 Notice: This session is being recorded. Copyright 2009 by Bob Brown http://xkcd.com/571/ Reminder:
What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
CS 6534: Tech Trends / Intro
1 CS 6534: Tech Trends / Intro Charles Reiss 24 August 2016 Moore s Law Microprocessor Transistor Counts 1971-2011 & Moore's Law 16-Core SPARC T3 2,600,000,000 1,000,000,000 Six-Core Core i7 Six-Core Xeon
CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN
CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN Graphics Processing Unit Accelerate the creation of images in a frame buffer intended for the output
Computing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
Advances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
Technology in Action. Chapter Topics. Participation Question. Participation Question. Participation Question 8/8/11
Technology in Action Chapter 6 Understanding and Assessing Hardware: Evaluating Your System 1 Chapter Topics To buy or to upgrade? Evaluating your system CPU RAM Storage devices Video card Sound card System
EE , GPU Programming
EE 4702-1, GPU Programming When / Where Here (1218 Patrick F. Taylor Hall), MWF 11:30-12:20 Fall 2017 http://www.ece.lsu.edu/koppel/gpup/ Offered By David M. Koppelman Room 3316R Patrick F. Taylor Hall
CUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
Graphics Hardware 2008
AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different? Many Cores Disruptive Change
Partial Wave Analysis using Graphics Cards
Partial Wave Analysis using Graphics Cards Niklaus Berger IHEP Beijing Hadron 2011, München The (computational) problem with partial wave analysis n rec * * i=1 * 1 Ngen MC NMC * i=1 A complex calculation
High Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research
1 Realtime Water Simulation on GPU Nuttapong Chentanez NVIDIA Research 2 3 Overview Approaches to realtime water simulation Hybrid shallow water solver + particles Hybrid 3D tall cell water solver + particles
Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications
AMD s Unified CPU & GPU Processor Concept
Advanced Seminar Computer Engineering Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview 1 2 Current Platforms: 3 4 5 Architecture 6 2/37 Single-thread Performance
Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.
Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the
Agenda. What is Ryzen? History. Features. Zen Architecture. SenseMI Technology. Master Software. Benchmarks
Ryzen Agenda What is Ryzen? History Features Zen Architecture SenseMI Technology Master Software Benchmarks The Ryzen Chip What is Ryzen? CPU chip family released by AMD in 2017, which uses their latest
ECE 486/586. Computer Architecture. Lecture # 2
ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:
Lecture 1: Introduction and Basics
CS 515 Programming Language and Compilers I Lecture 1: Introduction and Basics Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/5/2017 Class Information Instructor: Zheng (Eddy) Zhang Email: eddyzhengzhang@gmailcom
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
Παράλληλη Επεξεργασία
Παράλληλη Επεξεργασία Μέτρηση και σύγκριση Παράλληλης Απόδοσης Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2013 HW 1. Homework #3 due on cuda (summary of Tesla paper on web page) Slides based on Lin and Snyder textbook
Management Information Systems OUTLINE OBJECTIVES. Information Systems: Computer Hardware. Dr. Shankar Sundaresan
Management Information Systems Information Systems: Computer Hardware Dr. Shankar Sundaresan (Adapted from Introduction to IS, Rainer and Turban) OUTLINE Introduction The Central Processing Unit Computer
Chapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
The Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
Real-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance
Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to
Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
Programmer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
TMTO[dot]ORG: Hardware Comparison 8 x GTX580 vs. 4 x HD6990. Author: Jason R. Davis Site: TMTO[dot]ORG. Table of Contents
TMTO[dot]ORG: Hardware Comparison 8 x GTX580 vs. 4 x HD6990 Author: Jason R. Davis Site: TMTO[dot]ORG Table of Contents Foreword Page 2 Chapter 1 Building: The Machine Page 3 Chapter 2 8 x GTX580 Page
Multicore Computing and Scientific Discovery
scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research
Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.
Week 2, Lecture 1 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. Directory-Based Coherence Idea Maintain pointers instead of simple states with each cache block. Ingredients Data owners
Hybrid Memory Cube (HMC)
23 Hybrid Memory Cube (HMC) J. Thomas Pawlowski, Fellow Chief Technologist, Architecture Development Group, Micron jpawlowski@micron.com 2011 Micron Technology, I nc. All rights reserved. Products are
Parallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]
ECE7995 (4) Basics of Memory Hierarchy [Adapted from Mary Jane Irwin s slides (PSU)] Major Components of a Computer Processor Devices Control Memory Input Datapath Output Performance Processor-Memory Performance
Real-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
Pedraforca: a First ARM + GPU Cluster for HPC
www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu
Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
AMD EPYC PRESENTS OPPORTUNITY TO SAVE ON SOFTWARE LICENSING COSTS
AMD EPYC PRESENTS OPPORTUNITY TO SAVE ON SOFTWARE LICENSING COSTS BUSINESS SELECTION OF PROCESSOR SHOULD FACTOR IN SOFTWARE COSTS EXECUTIVE SUMMARY Software licensing models for many server applications
THE PHOTOGRAMMETRIC LOAD CHAIN FOR ADS IMAGE DATA AN INTEGRAL APPROACH TO IMAGE CORRECTION AND RECTIFICATION
THE PHOTOGRAMMETRIC LOAD CHAIN FOR ADS IMAGE DATA AN INTEGRAL APPROACH TO IMAGE CORRECTION AND RECTIFICATION M. Downey a, 1, U. Tempelmann b a Pixelgrammetry Inc., suite 212, 5438 11 Street NE, Calgary,
Heterogeneous-Race-Free Memory Models
Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential
Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Intro to HPC Architecture Instructor: Ekpe Okorafor A little about me! PhD Computer Engineering Texas A&M University Computer
Steve Scott, Tesla CTO SC 11 November 15, 2011
Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu
GPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
OpenMP tasking model for Ada: safety and correctness
www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel
CPU Pipelining Issues
CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput
Optimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
Optimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017
Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
CAESAR: Cryptanalysis of the Full AES Using GPU-Like Hardware
CAESAR: Cryptanalysis of the Full AES Using GPU-Like Hardware Alex Biryukov and Johann Großschädl Laboratory of Algorithmics, Cryptology and Security University of Luxembourg SHARCS 2012, March 17, 2012
Uniprocessor Computer Architecture Example: Cray T3E
Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure
PART III. GPU Cards and Architectures. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania!
Postgraduate course on Electronics and Informatics Engineering (M.Sc.) Training Course on Circuits Theory (prof. G. Capizzi)! Workshop on High performance computing and GPGPU computing Postgraduate course
CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
Where Have We Been? Ch. 6 Memory Technology
Where Have We Been? Combinational and Sequential Logic Finite State Machines Computer Architecture Instruction Set Architecture Tracing Instructions at the Register Level Building a CPU Pipelining Where
CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:
CS550 Advanced Operating Systems (Distributed Operating Systems) Instructor: Xian-He Sun Email: sun@iit.edu, Phone: (312) 567-5260 Office hours: 1:30pm-2:30pm Tuesday, Thursday at SB229C, or by appointment
Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.
I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
and Parallel Algorithms Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Organization People Waqar Saleem, waqar.saleem@uni-jena.de Jens Mueller, jkm@informatik.uni-jena.de Room 3335, Ernst-Abbe-Platz 2
Fundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control