The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009"

Transcription

1 The Many-Core Revolution Understanding Change Alejandro Cabrera January 29, 2009

2 Disclaimer This presentation currently contains several claims requiring proper citations and a few images that may or may not be licensed under the Creative Commons. The dwarf twins later on are CC-compatible. In short, it is not ready for production use. You've been warned.

3 Acknowledgements Berkeley View: The bulk of the presentation is based off of this paper. NVIDIA: Their GPUs and spec-sheets provide some exciting numbers. Google: Image searching made easy. Tilera: Many-core CPU.

4 Overview Exciting Pictures, Exciting Numbers State of the Core Why Many-Core? Common Wisdoms Refuted Dwarves and Applications Programming Many-Core Discussion

5 Many-Core CPU: Tilera Pictures

6 Many-Core CPU: Tilera Numbers Memory Installed 2.5GB DDR2 Processor Clock 700 MHz

7 Many-Core CPU: Tilera Numbers Not many numbers available yet... Memory Installed 2.5GB DDR2 Processor Clock 700 MHz

8 Many-Core GPU: Nvidia GTX 295 Pictures

9 Many-Core GPU: Nvidia GTX 295 Numbers GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec Memory Specs Memory Clock 999 MHz Memory Config 1792 MB GDDR3 Memory Interface Width 896-bit Memory Bandwidth GB/s

10 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec Typical CPU has no more than 4 cores!

11 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 92.2 billion pixels per second...

12 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 92.2 billion pixels per second... A high-end monitor has a resolution of 2560 x 1600

13 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 92.2 billion pixels per second... A high-end monitor has a resolution of 2560 x 1600 That's...

14 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 92.2 billion pixels per second... A high-end monitor has a resolution of 2560 x 1600 That's million pixels

15 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 4.1 million << 92.2 billion You could re-draw an entire scene about 22,500 times per second!

16 Many-Core GPU: Nvidia GTX 295 GPU Engine Specs Cores 480 Graphics Clock 576 MHz Processor Clock 1242 MHz Texture Fill Rate 92.2 billion pixels/sec 4.1 million << 92.2 billion You could re-draw an entire scene about 22,500 times per second! (assumes trivial, flat scene)

17 Many-Core GPU: Nvidia GTX 295 Numbers GB/s Memory Specs Memory Clock 999 MHz Memory Config 1792 MB GDDR3 Memory Interface Width 896-bit Memory Bandwidth GB/s

18 Many-Core GPU: Nvidia GTX 295 Numbers GB/s ~1TB / 4 seconds Memory Specs Memory Clock 999 MHz Memory Config 1792 MB GDDR3 Memory Interface Width 896-bit Memory Bandwidth GB/s

19 Many-Core GPU: Nvidia GTX 295 Numbers GB/s ~1TB / 4 seconds Let's picture it. Memory Specs Memory Clock 999 MHz Memory Config 1792 MB GDDR3 Memory Interface Width 896-bit Memory Bandwidth GB/s

20 Many-Core GPU: Nvidia GTX 295 Numbers WARNING The following slides feature assumptions that have no basis in reality. *A disk cannot store data at 223.8GB/s (yet)

21 Many-Core GPU: Nvidia GTX seconds of data processing ~1 TB

22 Many-Core GPU: Nvidia GTX minute of data processing

23 Many-Core GPU: Nvidia GTX hour of data processing 3.5 form factor = 4 x 5.75 x 1 = 23in 3 15 x 60 disks filled in one hour 23in 3 x 15 x 60 = 20,700in 3 = 1,725ft 3 Height of world's tallest building: 1,730ft To fit all the data, would require a cube as wide, long, and tall as Sears (Willis) Tower!

24 State of the Core Where Are We Now? Sequential processors aren't getting any faster No free lunch via Moore's Law voucher Quad-core = commodity Parallel applications few and far between in consumer markets Look to scientific and enterprise computing Multi-process Google Chrome browser Vendors want more performance and they want it yesterday Often a selling point

25 State of the Core Where Are We Now? Graphics processors no longer take the back seat GPGPU (2002) CUDA v1.1 (2007) CUDA v3.0b (2010) Powerful accelerators mingling with CPU IBM Cell

26 Biggest problem: State of the Core Where Are We Now? How do we develop efficient, correct, scalable parallel components? How do we develop highly-parallel applications composed of those components? Components: Data structures Algorithms

27 Why Many-Core? Beyond Exciting Numbers It's not a victory parade towards a bright, new idea. It's a retreat from an even greater challenge. We can't make sequential processors faster without melting them (or exploding our energy bills). We still want to get faster as quickly as possible, so we pursue the most immediate solution towards that end. As a result, many of our conventional wisdoms acquired over previous decades of computing have been overturned.

28 Common Wisdoms (Refuted) Power vs. Transistors Old Wisdom Power is free, but transistors are expensive. New Wisdom Power is expensive, but transistors are free. We can fit as many transistors on a chip as we have power to turn them on!

29 Common Wisdoms (Refuted) Dynamic vs. Static Power Old Wisdom You should only worry about dynamic power (voltage scaling). New Wisdom For desktops and servers, static power leakage can be 40% of total power.

30 Common Wisdoms (Refuted) Hardware Errors Old Wisdom Uniprocessors are reliable internally. New Wisdom With transistor designs falling below 65nm scale, errors occur at quantum level.

31 Common Wisdoms (Refuted) Scaling Designs Old Wisdom Old successes guide future successes, so we need only build upon prior designs. New Wisdom As design size (nanometer) drops, a multitude of factors will stretch development time.

32 Common Wisdoms (Refuted) Architecture Research Old Wisdom Let academia evaluate experimental designs they can build their own chips. New Wisdom Academia can no longer afford tools required to build believable chips.

33 Common Wisdoms (Refuted) Bandwidth vs. Latency Old Wisdom Performance improvements latency drops and bandwidth increases New Wisdom Bandwidth improves exponentially compared to latency (memory wall)

34 Common Wisdoms (Refuted) Computation vs. Memory Access Old Wisdom Store common computations in tables arithmetic is slow. New Wisdom Re-compute needed results data storage is slow.

35 Common Wisdoms (Refuted) Instruction Level Parallelism Old Wisdom There's an abundance of ILP waiting to be found by compilers, architectures designs, VLIW... New Wisdom Diminishing returns on ILP.

36 Common Wisdoms (Refuted) Moore's Law Old Wisdom Performance doubles every 18 months on a uniprocessor. New Wisdom Power Wall + Memory Wall + ILP Wall greater than 60 months to maintain Moore's Law for uniprocessors

37 Common Wisdoms (Refuted) Why Parallelize? Old Wisdom Don't bother parallelizing Moore's Law promises it'll run faster in a couple of years, unmodified. New Wisdom It'll be a long time before an unmodified program gets faster.

38 Common Wisdoms (Refuted) Parallel Performance Value Old Wisdom If it doesn't scale linearly, trash it. New Wisdom Any performance scaling is better than none- it's the only way to get faster now!

39 Common Wisdoms Moore's Law II New Wisdom The number of cores available on a chip doubles every 18 months.

40 Classifying Parallelism Dwarves To better understand parallel applications, a series of areas where parallelism is commonly exploited were analyzed. These areas are dwarves, patterns of communication and computation that identify a category of application.

41 Classifying Parallelism The Seven Dwarves

42 The Seven Dwarves Dense Linear Algebra (BLAS) Sparse Linear Algebra (conjugate gradient) Spectral Methods (FFT, DSP) N-Body simulation Structured Grid (PDE solver) Unstructured Grid Monte Carlo (embarrassingly parallel why?) Independent events

43 The Six (Other) Dwarves Why more dwarves? Up and coming algorithmic techniques and application domains require parallelization Observed domains: Combinatorial logic (SHA, MD5, AES, cryptography) Graph traversal (BFS, A*, maximum network flow) Finite state machines Bayesian networks, Hidden Markov Models Machine learning Dynamic programming Backtrack, branch-and-bound

44 The Six (Other) Dwarves In particular, no approach is currently known for parallel evaluation of a finite state machine. How can a system be in multiple states at once? Thought to be embarrassingly sequential. This brings us to our crux how do we develop new parallel applications?

45 Programming Many-Core We have exciting equipment. GPUs, CPUs, accelerators... We have clear applications areas. Linear algebra, graph traversal, dynamic programming... How do we use those exciting machines to satisfy those application requirements?

46 Programming Many-Core #include <???> import??? from

47 Programming Many-Core #include <???> import??? from How did I get here...?

48 Programming Many-Core #include <???> import??? from and who is that behind me?

49 Parallel Programming Models There are many (more) details that may be necessary to manage in order to produce an efficient parallel application versus a sequential application. Programming models seek to simplify the management of one or more of the following: Task identification Task mapping Data distribution Communication mapping Synchronization

50 MPI Parallel Programming Models A Few Existing Examples Pthreads MapReduce OpenMP CUDA OpenCL

51 Parallel Programming Models A Few Existing Examples Model Task ID Task Mapping Data Distrib. Comm. Mapping Sync MPI explicit explicit explicit implicit implicit Pthreads explicit explicit implicit implicit explicit MapReduce explicit implicit implicit implicit explicit OpenMP implicit implicit implicit implicit implicit CUDA explicit explicit explicit implicit explicit OpenCL N/A N/A N/A N/A N/A

52 Parallel Programming Models Message passing: Pros: Harder to make mistakes Much upfront planning Highly verifiable model Cons: Difficult to learn Hardware not yet thought of widely as networked components Shared Memory Pros: Friendly learning curve Matches hardware layout Cons: May not scale Cache coherence... Easy to make mistakes

53 Looking to the Many-Core Future We need: Better parallel programming models Better compilers A set of primitives to build APIs from Thoroughly tested parallel components Libraries Tools

54 Looking to the Many-Core Future Many challenges await. However, thanks to the revised Moore's Law, we have much to look forward to. Now, more than ever before, we'll be able to make great advances in sciences depending on extensive computation. All we have to do is add another core.

55

56 We'll yet get to the core of this issue.

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013 GPGPU on ARM Tom Gall, Gil Pitney, 30 th Oct 2013 Session Description This session will discuss the current state of the art of GPGPU technologies on ARM SoC systems. What standards are there? Where are

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have

More information

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:

More information

PFAC Library: GPU-Based String Matching Algorithm

PFAC Library: GPU-Based String Matching Algorithm PFAC Library: GPU-Based String Matching Algorithm Cheng-Hung Lin Lung-Sheng Chien Chen-Hsiung Liu Shih-Chieh Chang Wing-Kai Hon National Taiwan Normal University, Taipei, Taiwan National Tsing-Hua University,

More information

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice CSE 392/CS 378: High-performance Computing: Principles and Practice Administration Professors: Keshav Pingali 4.126 ACES Email: pingali@cs.utexas.edu Jim Browne Email: browne@cs.utexas.edu Robert van de

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Part IV. Review of hardware-trends for real-time ray tracing

Part IV. Review of hardware-trends for real-time ray tracing Part IV Review of hardware-trends for real-time ray tracing Hardware Trends For Real-time Ray Tracing Philipp Slusallek Saarland University, Germany Large Model Visualization at Boeing CATIA Model of Boeing

More information

Chapter 1: Introduction to Parallel Computing

Chapter 1: Introduction to Parallel Computing Parallel and Distributed Computing Chapter 1: Introduction to Parallel Computing Jun Zhang Laboratory for High Performance Computing & Computer Simulation Department of Computer Science University of Kentucky

More information

The Future of GPU Computing

The Future of GPU Computing The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2 Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving

More information

Using CUDA to Accelerate Radar Image Processing

Using CUDA to Accelerate Radar Image Processing Using CUDA to Accelerate Radar Image Processing Aaron Rogan Richard Carande 9/23/2010 Approved for Public Release by the Air Force on 14 Sep 2010, Document Number 88 ABW-10-5006 Company Overview Neva Ridge

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Direct Rendering of Trimmed NURBS Surfaces

Direct Rendering of Trimmed NURBS Surfaces Direct Rendering of Trimmed NURBS Surfaces Hardware Graphics Pipeline 2/ 81 Hardware Graphics Pipeline GPU Video Memory CPU Vertex Processor Raster Unit Fragment Processor Render Target Screen Extended

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

GPU Architecture Overview

GPU Architecture Overview Course Overview Follow-Up GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Read before or after lecture? Project vs. final project Closed vs. open source Feedback

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group avi.mendelson@intel.com 1 Disclaimer No Intel proprietary information is disclosed. Every future estimate

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

Big Orange Bramble. August 09, 2016

Big Orange Bramble. August 09, 2016 Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism. Twos Complement Signed Numbers IT 3123 Hardware and Software Concepts Modern Computer Implementations April 26 Notice: This session is being recorded. Copyright 2009 by Bob Brown http://xkcd.com/571/ Reminder:

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

CS 6534: Tech Trends / Intro

CS 6534: Tech Trends / Intro 1 CS 6534: Tech Trends / Intro Charles Reiss 24 August 2016 Moore s Law Microprocessor Transistor Counts 1971-2011 & Moore's Law 16-Core SPARC T3 2,600,000,000 1,000,000,000 Six-Core Core i7 Six-Core Xeon

More information

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN Graphics Processing Unit Accelerate the creation of images in a frame buffer intended for the output

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Technology in Action. Chapter Topics. Participation Question. Participation Question. Participation Question 8/8/11

Technology in Action. Chapter Topics. Participation Question. Participation Question. Participation Question 8/8/11 Technology in Action Chapter 6 Understanding and Assessing Hardware: Evaluating Your System 1 Chapter Topics To buy or to upgrade? Evaluating your system CPU RAM Storage devices Video card Sound card System

More information

EE , GPU Programming

EE , GPU Programming EE 4702-1, GPU Programming When / Where Here (1218 Patrick F. Taylor Hall), MWF 11:30-12:20 Fall 2017 http://www.ece.lsu.edu/koppel/gpup/ Offered By David M. Koppelman Room 3316R Patrick F. Taylor Hall

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation

More information

Graphics Hardware 2008

Graphics Hardware 2008 AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different? Many Cores Disruptive Change

More information

Partial Wave Analysis using Graphics Cards

Partial Wave Analysis using Graphics Cards Partial Wave Analysis using Graphics Cards Niklaus Berger IHEP Beijing Hadron 2011, München The (computational) problem with partial wave analysis n rec * * i=1 * 1 Ngen MC NMC * i=1 A complex calculation

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research

Realtime Water Simulation on GPU. Nuttapong Chentanez NVIDIA Research 1 Realtime Water Simulation on GPU Nuttapong Chentanez NVIDIA Research 2 3 Overview Approaches to realtime water simulation Hybrid shallow water solver + particles Hybrid 3D tall cell water solver + particles

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications

More information

AMD s Unified CPU & GPU Processor Concept

AMD s Unified CPU & GPU Processor Concept Advanced Seminar Computer Engineering Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview 1 2 Current Platforms: 3 4 5 Architecture 6 2/37 Single-thread Performance

More information

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh. Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the

More information

Agenda. What is Ryzen? History. Features. Zen Architecture. SenseMI Technology. Master Software. Benchmarks

Agenda. What is Ryzen? History. Features. Zen Architecture. SenseMI Technology. Master Software. Benchmarks Ryzen Agenda What is Ryzen? History Features Zen Architecture SenseMI Technology Master Software Benchmarks The Ryzen Chip What is Ryzen? CPU chip family released by AMD in 2017, which uses their latest

More information

ECE 486/586. Computer Architecture. Lecture # 2

ECE 486/586. Computer Architecture. Lecture # 2 ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:

More information

Lecture 1: Introduction and Basics

Lecture 1: Introduction and Basics CS 515 Programming Language and Compilers I Lecture 1: Introduction and Basics Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/5/2017 Class Information Instructor: Zheng (Eddy) Zhang Email: eddyzhengzhang@gmailcom

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Παράλληλη Επεξεργασία

Παράλληλη Επεξεργασία Παράλληλη Επεξεργασία Μέτρηση και σύγκριση Παράλληλης Απόδοσης Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2013 HW 1. Homework #3 due on cuda (summary of Tesla paper on web page) Slides based on Lin and Snyder textbook

More information

Management Information Systems OUTLINE OBJECTIVES. Information Systems: Computer Hardware. Dr. Shankar Sundaresan

Management Information Systems OUTLINE OBJECTIVES. Information Systems: Computer Hardware. Dr. Shankar Sundaresan Management Information Systems Information Systems: Computer Hardware Dr. Shankar Sundaresan (Adapted from Introduction to IS, Rainer and Turban) OUTLINE Introduction The Central Processing Unit Computer

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

TMTO[dot]ORG: Hardware Comparison 8 x GTX580 vs. 4 x HD6990. Author: Jason R. Davis Site: TMTO[dot]ORG. Table of Contents

TMTO[dot]ORG: Hardware Comparison 8 x GTX580 vs. 4 x HD6990. Author: Jason R. Davis Site: TMTO[dot]ORG. Table of Contents TMTO[dot]ORG: Hardware Comparison 8 x GTX580 vs. 4 x HD6990 Author: Jason R. Davis Site: TMTO[dot]ORG Table of Contents Foreword Page 2 Chapter 1 Building: The Machine Page 3 Chapter 2 8 x GTX580 Page

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile. Week 2, Lecture 1 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. Directory-Based Coherence Idea Maintain pointers instead of simple states with each cache block. Ingredients Data owners

More information

Hybrid Memory Cube (HMC)

Hybrid Memory Cube (HMC) 23 Hybrid Memory Cube (HMC) J. Thomas Pawlowski, Fellow Chief Technologist, Architecture Development Group, Micron jpawlowski@micron.com 2011 Micron Technology, I nc. All rights reserved. Products are

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)] ECE7995 (4) Basics of Memory Hierarchy [Adapted from Mary Jane Irwin s slides (PSU)] Major Components of a Computer Processor Devices Control Memory Input Datapath Output Performance Processor-Memory Performance

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

AMD EPYC PRESENTS OPPORTUNITY TO SAVE ON SOFTWARE LICENSING COSTS

AMD EPYC PRESENTS OPPORTUNITY TO SAVE ON SOFTWARE LICENSING COSTS AMD EPYC PRESENTS OPPORTUNITY TO SAVE ON SOFTWARE LICENSING COSTS BUSINESS SELECTION OF PROCESSOR SHOULD FACTOR IN SOFTWARE COSTS EXECUTIVE SUMMARY Software licensing models for many server applications

More information

THE PHOTOGRAMMETRIC LOAD CHAIN FOR ADS IMAGE DATA AN INTEGRAL APPROACH TO IMAGE CORRECTION AND RECTIFICATION

THE PHOTOGRAMMETRIC LOAD CHAIN FOR ADS IMAGE DATA AN INTEGRAL APPROACH TO IMAGE CORRECTION AND RECTIFICATION THE PHOTOGRAMMETRIC LOAD CHAIN FOR ADS IMAGE DATA AN INTEGRAL APPROACH TO IMAGE CORRECTION AND RECTIFICATION M. Downey a, 1, U. Tempelmann b a Pixelgrammetry Inc., suite 212, 5438 11 Street NE, Calgary,

More information

Heterogeneous-Race-Free Memory Models

Heterogeneous-Race-Free Memory Models Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Intro to HPC Architecture Instructor: Ekpe Okorafor A little about me! PhD Computer Engineering Texas A&M University Computer

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

OpenMP tasking model for Ada: safety and correctness

OpenMP tasking model for Ada: safety and correctness www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017 Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

CAESAR: Cryptanalysis of the Full AES Using GPU-Like Hardware

CAESAR: Cryptanalysis of the Full AES Using GPU-Like Hardware CAESAR: Cryptanalysis of the Full AES Using GPU-Like Hardware Alex Biryukov and Johann Großschädl Laboratory of Algorithmics, Cryptology and Security University of Luxembourg SHARCS 2012, March 17, 2012

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

PART III. GPU Cards and Architectures. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania!

PART III. GPU Cards and Architectures. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania! Postgraduate course on Electronics and Informatics Engineering (M.Sc.) Training Course on Circuits Theory (prof. G. Capizzi)! Workshop on High performance computing and GPGPU computing Postgraduate course

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Where Have We Been? Ch. 6 Memory Technology

Where Have We Been? Ch. 6 Memory Technology Where Have We Been? Combinational and Sequential Logic Finite State Machines Computer Architecture Instruction Set Architecture Tracing Instructions at the Register Level Building a CPU Pipelining Where

More information

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

CS550. TA: TBA   Office: xxx Office hours: TBA. Blackboard: CS550 Advanced Operating Systems (Distributed Operating Systems) Instructor: Xian-He Sun Email: sun@iit.edu, Phone: (312) 567-5260 Office hours: 1:30pm-2:30pm Tuesday, Thursday at SB229C, or by appointment

More information

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques. I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

and Parallel Algorithms Programming with CUDA, WS09 Waqar Saleem, Jens Müller

and Parallel Algorithms Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Organization People Waqar Saleem, waqar.saleem@uni-jena.de Jens Mueller, jkm@informatik.uni-jena.de Room 3335, Ernst-Abbe-Platz 2

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information