The End of Denial Architecture and The Rise of Throughput Computing
|
|
- Emily Nelson
- 5 years ago
- Views:
Transcription
1 The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University May 19, 2009 Async Conference UNC
2 Outline Performance = Parallelism Efficiency = Locality Single-thread processors are in denial about these two facts We are entering the era of Throughput Computing Stream Processors Parallel arithmetic units Exposed storage hierarchy Stream Programming Bulk operations GPU Computing
3 Moore s Law In 1965 Gordon Moore predicted the number of transistors on an integrated circuit would double every year. Later revised to 18 months Also predicted L 3 power scaling for constant function No prediction of processor performance Advanceds in architecture turn device scaling into performance scaling Moore, Electronics 38(8) April 19, 1965
4 The Computer Value Chain Moore s law gives us more transistors Architects turn these transistors into more performance Applications turn this performance into value for the user
5 Discontinuity 1 The End of ILP Scaling 1e+7 1e+6 Perf (ps/inst) 1e+5 1e+4 1e+3 1e+2 1e+1 19%/year 1e Dally et al. The Last Classical Computer, ISAT Study, 2001
6 Future potential of novel architecture is large (1000 vs 30) 1e+7 1e+6 1e+5 1e+4 Perf (ps/inst) Linear (ps/inst) 1e+3 1e+2 1e+1 1e+0 30:1 19%/year 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Dally et al. The Last Classical Computer, ISAT Study, 2001
7 Single-Thread Processor Performance vs Calendar Year Source: Hennessy & Patterson, CAAQA, 4th Edition
8 Technology Constraints
9 CMOS Chip is our Canvas 20mm
10 4,000 64b FPUs fit on a chip 64b FPU 20mm 0.1mm 2 50pJ/op 1.5GHz
11 200,000 16b MACs fit on a chip 64b FPU 0.1mm 2 50pJ/op 1.5GHz 20mm. 16b MAC 0.002mm 2 1pJ/op 1.5GHz
12 Moving a word across die = 124MACs, 10FMAs Moving a word off chip = 250 MACs, 20FMAs 64b FPU 0.1mm 2 50pJ/op 1.5GHz 20mm. 16b MAC 0.002mm 2 1pJ/op 1.5GHz 64b 1mm Channel 25pJ/word 10mm 250pJ, 4cycles 10mm 62pJ, 4cycles 16b 1mm Channel 6pJ/word 64b Off-Chip Channel 1nJ/word 16b Off-Chip Channel 250pJ/word 64b Floating Point 16b Fixed Point
13 Discontinuity 2 The End of L 3 Power Scaling Gordon Moore, ISSCC 2003
14 Performance = Parallelism Efficiency = Locality
15 The End of Denial Architecture Single thread processors are in denial about parallelism and locality They provide two illusions: Serial execution Denies parallelism Tries to exploit parallelism with ILP limited scalability Flat memory Denies locality Tries to provide illusion with caches very inefficient when working set doesn t fit in the cache
16 We are entering the era of Throughput Computing
17 Throughput computing is what matters now Latency optimized processors (most CPUs) are improving very slowly Little value is being delivered from the evolution of these processors Throughput optimized processors (like GPUs) are still improving at >70% per year This drives new throughput applications that convert this performance to value Going forward throughput processors matter, not latency processors
18 Applications
19 Scientific Applications Large data sets Lots of parallelism Increasingly irregular (AMR) Irregular and dynamic data structures Requires efficient gather/scatter Increasingly complex models Lots of locality Global solution sometimes bandwidth limited Less locality in these phases
20 Embedded Applications Codecs, modems, image processing, etc Lots of data (pixels, samples, etc ) Lots of parallelism Increasingly complex Lots of parallelism Lots of data dependent control High arithmetic intensity Lots of locality
21 Performance = Parallelism Efficiency = Locality Fortunately, most applications have lots of both. Amdahl s law doesn t apply to most future applications.
22 Stream Processor Architecture
23 Organize computation to Optimize use of scarce bandwidth Minimize expensive data movement Keep scarce bandwidth resources busy Take advantage of plentiful arithmetic Operate in parallel when data is local Avoid denial architecture Don t hide parallelism or locality Flat, serial model inhibits optimization
24 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Global Memory Switch LM CM Switch RM Switch RM Switch RM Switch R R R R R R R R R A A A A A A A A A
25 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Explicitly manage data movement on this hierarchy Reduces demand, increases utilization Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Face Geometry Numerical Flux Cell Geometry Elements (Current) Elements (New) Gathered Elements Cell Orientations
26 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Explicitly manage data movement Makes execution predictable Enables more arithmetic per unit bandwidth
27 Summary so far Arithmetic is cheap, bandwidth is expensive Performance = parallelism Efficiency = Locality And most applications have lots of both Optimize use of scarce bandwidth Rich, exposed storage hierarchy Explicit, bulk transfers Reduces demand, increases utilization Enables more arithmetic and predictable execution
28 What is a Stream Processor? Lots of arithmetic units To exploit parallelism Rich, exposed storage hierarchy Makes frequent communication inexpensive Makes kernel execution predictable Explicit bulk transfers (streams) and bulk operations (kernels) Load/Store machine for streams Simple control SIMD x VLIW x Threads Simple synchronization Low Overhead Local Switch Lane 0 Lane 15 LdSt ORF COM ORF ALU0 ORF ALU4 LdSt LdSt Lane Register File 0 16 KBytes InterLane Switch LdSt Local Switch LdSt Lane Register File KBytes Stream Load/Store Unit ORF COM ORF ALU0 ORF ALU4 LdSt LdSt LdSt External DRAM
29 4/6/09 Some Example Stream Processors
30 Imagine Cell 4/6/09 Forward ECC 16 RDRAM Interfaces $ bank $ bank $ bank $ bank $ bank $ bank $ bank $ bank Cluster Cluster Cluster Cluster Cluster Cluster Network Cluster Microcontroller 10.2 mm Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Mips64 20kc Cluster Mips64 20kc Mem switc h Addres s Gen Addres s Gen Addres s Gen Addres s Gen Reorder Buf fer Reorder Buf fer Example Stream Processors Merrimac 12.5 mm Nvidia GT200 Storm-1
31 Imagine b FPUs 8 lanes of 6 32b FPUs each Exposed storage hierarchy LRFs, SRF, DRAM Bulk operations Including gather/scatter Conditional streams Applications Graphics OpenGL, RTSL, Reyes, Ray Tracing Image processing Codecs Modems Network processing 20x performance/w of efficient DSPs in same technology Parity with GPUs in same technology
32 Imagine Architecture Host Processor Imagine Processor Stream Controller Micro- Controller SDRAM Mem System SRF Bank ALU cluster SDRAM Mem System SRF Bank ALU cluster Imagine Network Network Interface SRF Bank ALU cluster
33 Imagine Architecture (cont.) Hostr Processor Imagine Processor Stream Controller Micro- Controller SDRAM Mem System SRF Bank ALU cluster SDRAM Mem System SRF Bank ALU cluster Imagine Network Network Interface SRF Bank ALU cluster ADD ADD ADD MUL MUL DSQ SP COMM To/From Other Clusters Intra-cluster Switch To/From SRF
34 Producer-Consumer Locality in the Depth Extractor Memory/Global Data SRF/Streams Clusters/Kernels row of pixels previous partial sums new partial sums blurred row previous partial sums new partial sums sharpened row filtered row segment filtered row segment previous partial sums new partial sums depth map row segment Convolution (Gaussian) Convolution (Laplacian) SAD 1 : 23 : 317
35 Applications match the bandwidth hierarchy 1000 Bandwidth (GB/s) LRF SRF DRAM 0.1 Peak DEPTH MPEG QRD RTSL
36 Merrimac b FP MADD units 16 clusters of 4 units each Stream cache Shared memory Chip Crossing(s) 10kχ switch 1kχ switch LRF 100χ wire Via network Scatter-add DRAM Bank Cache Bank SRF Lane CL SW LRF 16GB/s Chip Pins and Router 64GB/s M SW 512GB/s 3,840GB/s LRF DRAM Bank Cache Bank SRF Lane CL SW LRF
37 Merrimac Application Results Application Developed Compiled Sustained GFLOP/s StreamFEM Brook StreamC 69 Speedup over Pentium4 Efficiency vs. Pentium4 StreamMD C StreamC StreamFLO Brook Metacompile d 13 StreamCDP Fortran StreamC 8 DGEMM StreamC StreamC CONV2D StreamC StreamC 79 FFT3D StreamC StreamC StreamSPAS C StreamC
38 Merrimac Bandwidth Hierarchy Application Sustained GFLOPS FP Ops / Mem Ref LRF Refs SRF Refs Mem Refs StreamFEM3D (Euler, quadratic) StreamFEM3D (MHD, constant) M (95.0%) M (99.4%) GROMACS 50.1 * * 108M (StreamMD) (95.0%) StreamFLO 12.9 * 7.4 * 234.3M (95.7%) StreamSPAS 6.2 (80% BW) M (57.6%) Dense mat-mul 116 (90% FLOPS) M (93.1%) StreamCDP M (58.9%) 6.3M (3.9%) 7.7M (0.4%) 4.2M (2.9%) 7.2M (2.9%) 1.9M (28.5%) 17.1M (6.6%) 1.9M (25.8%) 1.8M (1.1%) 2.8M (0.2%) 1.5M (1.3%) 3.4M (1.4%) 0.93M (13.9%) 0.6M (0.2%) 1.1M (15.3%) * Dominated by divide and square-root operations
39 A CUDA-Enabled GPU is The Ultimate Throughput Computer GeForce GTX 280 / Tesla T scalar processors 30SMs x 8 each > 1TFLOPS peak Exposed, hierarchical memory x performance & efficiency of a latency-optimized processor
40 GPU Throughput Computer Communication Fabric Memory & I/O Fixed Function Acceleration GeForce GTX 280 / Tesla T10
41 4/6/09 Stream Programming
42 Stream Programming: Parallelism, Locality, and Predictability Parallelism Data parallelism across stream elements Task parallelsm across kernels ILP within kernels Locality Producer/consumer Within kernels Predictability Enables scheduling K2 K1 K4 K3 4/6/09
43 Evolution of Stream Programming 1997 StreamC/KernelC Break programs into kernels and streams Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2002 Brook Continues the construct of streams and kernels Hides underlying details Too one-dimensional 2007 Sequoia Generalizes kernels to tasks Tasks operate on local data Local data gathered in an arbitrary way Inner tasks subdivide, leaf tasks compute Machine-specific details factored out
44 Stream Programming (Sequoia view) Express computation as tree of tasks Inner tasks (streams) move data to subdivide the problem Leaf tasks (kernels) solve local instance of the problem Makes leaf tasks completely predictable Enables static optimization Enables strategic optimization Stream scheduling Tuning of data movement Node memory matmul inner Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU
45 Sequoia Generalize Kernels into Leaf Tasks Perform actual computation Analogous to kernels Small working set Node memory Aggregate LS void task matmul::leaf( in float A[M][P], in float B[P][N], inout float C[M][N] ) { for (int i=0; i<m; i++) { for (int j=0; j<n; j++) { for (int k=0; k<p; k++) { C[i][j] += A[i][k] * B[k][j]; } LS 0 matmul leaf FU LS 7 FU
46 Inner tasks Decompose to smaller subtasks Recursively Larger working sets Node memory void task matmul::inner( in float A[M][P], in float B[P][N], inout float C[M][N] ) { tunable unsigned int U, X, V; blkset Ablks = rchop(a, U, X); blkset Bblks = rchop(b, X, V); blkset Cblks = rchop(c, U, V); mappar (int i=0 to M/U, int j=0 to N/V) mapreduce (int k=0 to P/X) Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU } matmul(ablks[i][k],bblks[k][j],cblks[i][j]);
47 CUDA as a Stream Language Explicit control of the memory hierarchy with shared memory shared float a[size] ; Also enables communication between threads of a CTA Transfer data up and down the hierarchy Operate on data with kernels But does allow access to arbitrary data within a kernel
48 Examples 146X 36X 19X 17X 100X 149X 47X 20X 24X 30X
49 Stream loads/stores (bulk operations) hide latency (1000s of words in flight) Cells gather Cells fn1 LD Cells 0 Flux fn2 LD Cells 1 Flux 0 scatter Cells LD Cells 2 Update 0 Cells DRAM SRFs LRFs ST Cells 0 Flux 1 LD Cells 3 Update 1 ST Cells 1 Flux 2
50 Explicit storage enables simple, efficient execution LD Cells 0 All needed data and instructions on-chip no misses LD Cells 1 Flux 0 LD Cells 2 Update 0 ST Cells 0 Flux 1 LD Cells 3 Update 1 ST Cells 1 Flux 2
51 Caches lack predictability (controlled via a wet noodle ) LD Cells 0 LD Cells 0 LD Cells 1 Flux 0 LD Cells 1 Flux 0A Miss LD Cells 2 ST Cells 0 Update 0 Flux 1 LD Cells 2 Flux 0B Update 0A Miss LD Cells 3 Update 1 Update 0B ST Cells 1 Flux 2 ST Cells 0 Flux 1
52 Caches are controlled via a wet 99% hit rate, 1 miss costs noodle 100s of cycles, 10,000s of ops LD Cells 0 LD Cells 0 LD Cells 1 Flux 0 LD Cells 1 Flux 0A Miss LD Cells 2 ST Cells 0 Update 0 Flux 1 LD Cells 2 Flux 0B Update 0A Miss LD Cells 3 Update 1 Update 0B ST Cells 1 Flux 2 ST Cells 0 Flux 1
53 Predictability enables efficient static scheduling One iteration SW Pipeline ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable
54 Bulk operations enable strategic optimization StreamFEM application Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Face Geometry Numerical Flux Cell Geometry Elements (Current) Elements (New) Gathered Elements Cell Orientations
55 Results Horizontal portability Scalar SMP Disk Cluster Cell PS3 SAXPY SGEMV SGEMM CONV2D FFT3D * GRAVITY HMMER * *Problem size reduced to fit * Reduced dataset size to fit in memory
56 System Utilization 100 SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER Percentage of Runtime 0 SMP Disk Cluster Cell PS3 56
57 Resource Utilization IBM Cell 100 Bandwidth utilization Compute utilization 0 Resource Utilization (%)
58 Search Spaces T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation. In IEEE PACT, pages , 2000.
59 Performance of Auto-tuner Conv2D SGEMM FFT3D SUmb Cell Auto Hand Cluster Auto Hand Cluster of PS3s Auto Hand Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS. For FFT3D, performances is with fusion of leaf tasks. SUmb is too complicated to be hand-tuned.
60 Conclusion
61 The Road Ahead 20x improvement in GPU performance by 2015 <2x improvement in CPU performance by 2015 Most of the value delivered to end user comes from the GPU, the throughput computer But not all things scale at the same rate Memory bandwidth scales by <5x Energy per bit-mm on chip nearly constant Energy per op only improves by 5x Each of these presents an interesting challenge
62 Conclusion Parallelism and Locality for efficient computation Denial architecture is at an end We are entering an era of Throughput Computing Heterogeneous computers Value comes from Throughput Applications running on Throughput Processors (like a GPU) Performance = parallelism, Efficiency = locality Applications have lots of both Stream processing Many ALUs exploit parallelism Rich, exposed storage hierarchy enables locality Simple control and synchronization reduces overhead Stream programming - explicit movement, bulk operations Enables strategic optimization Exposes parallelism and locality Result: performance and efficiency TOPs on a chip 20-30x efficiency of conventional processors. Performance portability GPUs are the ultimate stream processors
63 4/6/09 Backup
64 Application Performance (cont.) Execution time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% DEPTH MPEG QRD RTSL Average host bandwidth stalls stream controller overhead memory stalls cluster stalls kernel non main loop kernel main loop overhead operations 4/6/09
The End of Denial Architecture and The Rise of Throughput Computing
The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University Outline! Performance = Parallelism
More informationStream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006
Stream Programming: Explicit Parallelism and Locality Bill Dally Edge Workshop May 24, 2006 Edge: 1 May 24, 2006 Outline Technology Constraints Architecture Stream programming Imagine and Merrimac Other
More informationStream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop
Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams
More informationThe Future of GPU Computing
The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist
More informationSequoia. Mattan Erez. The University of Texas at Austin
Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler
More informationHalfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9
Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with
More informationStream Processing: a New HW/SW Contract for High-Performance Efficient Computation
Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors
More informationSequoia. Mike Houston Stanford University July 9 th, CScADS Workshop
Sequoia Mike Houston Stanford University July 9 th, 2007 - CScADS Workshop Today s outline Sequoia programming model Sequoia targets Tuning in Sequoia http://sequoia.stanford.edu - Supercomputing 2006
More informationEE482S Lecture 1 Stream Processor Architecture
EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered
More informationEE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture
1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationA Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis
A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development
More informationLecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis
Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 9. Thanks to many sources for slide material: Computer Organization and Design
More informationStream Processing for High-Performance Embedded Systems
Stream Processing for High-Performance Embedded Systems William J. Dally Computer Systems Laboratory Stanford University HPEC September 25, 2002 Stream Proc: 1 Sept 25, 2002 Report Documentation Page Form
More informationA Data-Parallel Genealogy: The GPU Family Tree
A Data-Parallel Genealogy: The GPU Family Tree Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Outline Moore s Law brings
More informationEvaluating the Imagine Stream Architecture
Evaluating the Imagine Stream Architecture Jung Ho Ahn, William J. Dally, Brucek Khailany, Ujval J. Kapasi, and Abhishek Das Computer Systems Laboratory Stanford University, Stanford, CA 94305, USA {gajh,billd,khailany,ujk,abhishek}@cva.stanford.edu
More informationTHE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research
THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationFrom Brook to CUDA. GPU Technology Conference
From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIMAGINE: Signal and Image Processing Using Streams
IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationMEMORY HIERARCHY DESIGN FOR STREAM COMPUTING
MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationData Parallel Architectures
EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationGraphics Processor Acceleration and YOU
Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have
More informationWhat s New with GPGPU?
What s New with GPGPU? John Owens Assistant Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Microprocessor Scaling is Slowing
More informationGPGPU. Peter Laurens 1st-year PhD Student, NSC
GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationReal-Time Support for GPU. GPU Management Heechul Yun
Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks
More informationWhy GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationStream Processors. Many signal processing applications require. Programmability with Efficiency
WILLIAM J. DALLY, UJVAL J. KAPASI, BRUCEK KHAILANY, JUNG HO AHN, AND ABHISHEK DAS, STANFORD UNIVERSITY Many signal processing applications require both efficiency and programmability. Baseband signal processing
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationThe Implementation and Analysis of Important Symmetric Ciphers on Stream Processor
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationExploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationAnalysis and Performance Results of a Molecular Modeling Application on Merrimac
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Mattan Erez Jung Ho Ahn Ankit Garg William J. Dally Eric Darve Department of Electrical Engineering Department of Mechanical
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationLocality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationCPU Pipelining Issues
CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput
More informationComputer Architecture s Changing Definition
Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More information