Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006
|
|
- Claud York
- 5 years ago
- Views:
Transcription
1 Stream Programming: Explicit Parallelism and Locality Bill Dally Edge Workshop May 24, 2006 Edge: 1 May 24, 2006
2 Outline Technology Constraints Architecture Stream programming Imagine and Merrimac Other stream processors Future directions Edge: 2 May 24, 2006
3 ILP is mined out end of superscalar processors Time for a new architecture 1e+7 1e+6 1e+5 1e+4 1e+3 52%/year Perf (ps/inst) Linear (ps/inst) 1e+2 1e+1 1e+0 74%/year 30:1 19%/year 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Dally et al. The Last Classsical Computer, ISAT Study, 2001 Edge: 3 May 24, 2006
4 Performance = Parallelism Efficiency = Locality Edge: 4 May 24, 2006
5 Arithmetic is cheap, Communication is expensive Arithmetic Can put 100s of FPUs on a chip $0.50/GFLOPS, 50mW/GFLOPS Exploit with parallelism Communication Dominates cost $8/GW/s 2W/GW/s (off-chip) BW decreases (and cost increases) with distance Power increases with distance Latency increases with distance But can be hidden with parallelism Need locality to conserve global bandwidth 64-bit FPU (to scale) 0.5mm Decreasing BW Increasing power 1 clock 12mm 90n m Chip $200 1GHz Edge: 5 May 24, 2006
6 Cost of data access varies by 1000x From Energy Cost* Time Local Register 10pJ $0.50 1ns Chip Region (2mm) 50pJ $2 4ns Global on Chip (15mm) 200pJ $10 20ns Off chip (node mem) 1nJ $50 200ns Global 5nJ $500 1us *Cost of providing 1GW/s of bandwidth All numbers approximate Edge: 6 May 24, 2006
7 So we should build chips that look like this Edge: 7 May 24, 2006
8 An abstract view Global Memory Switch LM CM Switch RM Switch R R R RM Switch R R R RM Switch R R R A A A A A A A A A Edge: 8 May 24, 2006
9 Real question is: How to orchestrate movement of data Edge: 9 May 24, 2006
10 Conventional Wisdom: Use caches Global Memory Switch LM CM Switch RM Switch R R R RM Switch R R R RM Switch R R R A A A A A A A A A Edge: 10 May 24, 2006
11 Caches squander bandwidth our scarce resource Unnecessary data movement Poorly scheduled data movement Idles expensive resources waiting on data More efficient to map programs to an explicit memory hierarchy Edge: 11 May 24, 2006
12 Example Simplified Finite-Element Code loop over cells flux[i] =... loop over cells... = f(flux[i],...) Edge: 12 May 24, 2006
13 Explicitly block into SRF loop over cells flux[i] =... loop over cells... = f(flux[i],...) Flux passed through SRF, no memory traffic Edge: 13 May 24, 2006
14 Explicitly block into SRF loop over cells flux[i] =... loop over cells... = f(flux[i],...) Explicit re-use of Cells, no misses Edge: 14 May 24, 2006
15 Stream loads/stores (bulk operations) hide latency (1000s of words in flight) Cells gather Cells fn1 Flux fn2 scatter Cells DRAM Cells SRFs LRFs Edge: 15 May 24, 2006
16 Explicit storage enables simple, efficient execution All needed data and instructions on-chip no misses Edge: 16 May 24, 2006
17 Caches lack predictability (controlled via a wet noodle ) Edge: 17 May 24, 2006
18 Caches are controlled via a wet noodle 99% hit rate, 1 miss costs 100s of cycles, 10,000s of ops Edge: 18 May 24, 2006
19 So how do we program an explicit hierarchy? Edge: 19 May 24, 2006
20 Stream Programming: Parallelism, Locality, and Predictability Parallelism Data parallelism across stream elements Task parallelsm across kernels ILP within kernels Locality Producer/consumer Within kernels Predictability Enables scheduling K1 K2 K3 K4 Edge: 20 May 24, 2006
21 Evolution of Stream Programming 1997 StreamC/KernelC Break programs into kernels Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2001 Brook Continues the construct of streams and kernels Hides underlying details Too one-dimensional 2005 Sequoia Generalizes kernels to tasks Tasks operate on local data Local data gathered in an arbitrary way Inner tasks subdivide, leaf tasks compute Machine-specific details factored out Edge: 21 May 24, 2006
22 StreamC/KernelC Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve STREAMPROG depth) { im_stream<pixels> in, tmp; for (i=0; i<rows; i++) { convolve(in, tmp, ); convolve(tmp, conv_row, ); } for (i=0; i<rows; i++) { SAD(conv_row, depth_row, ); } } KERNEL convolve( istream<int> a, ostream<int> y) { loop_stream(a) { int ai, out; a >> ai; out = dotproduct(ai, ); y << out; } } Edge: 22 May 24, 2006
23 Explicit storage enables simple, efficient execution unit scheduling One iteration SW Pipeline ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable Edge: May 24, 2006
24 Stream scheduling exploits explicit storage to reduce bandwidth demand StreamFEM application Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Gathered Elements Face Geometry Numerical Flux Cell Orientations Cell Geometry Elements (Current) Elements (New) Prefetching, reuse, use/def, limited spilling Edge: 24 May 24, 2006
25 Sequoia Generalize Kernels into Leaf Tasks Perform actual computation Analogous to kernels Small working set Node memory Aggregate LS void task matmul::leaf( in float A[M][P], in float B[P][N], inout float C[M][N] ) { for (int i=0; i<m; i++) { for (int j=0; j<n; j++) { for (int k=0; k<p; k++) { C[i][j] += A[i][k] * B[k][j]; } LS 0 matmul leaf FU LS 7 FU Edge: 25 May 24, 2006
26 Inner tasks Decompose to smaller subtasks Recursively Larger working sets void task matmul::inner( in float A[M][P], in float B[P][N], inout float C[M][N] ) { tunable unsigned int U, X, V; blkset Ablks = rchop(a, U, X); blkset Bblks = rchop(b, X, V); blkset Cblks = rchop(c, U, V); mappar (int i=0 to M/U, int j=0 to N/V) mapreduce (int k=0 to P/X) Node memory Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU } matmul(ablks[i][k],bblks[k][j],cblks[i][j]); Edge: 26 May 24, 2006
27 Stream Processors make communication explicit Enables optimization Edge: 27 May 24, 2006
28 Stream architecture makes communication explicit exploits parallelism and locality Chip Crossing(s) 10kχ switch 1kχ switch 100χ wire LRF DRAM Bank Cache Bank SRF Lane CL SW LRF Chip Pins and Router M SW LRF DRAM Bank Cache Bank ALU and cluster arrays shown 1D LRF here may be laid out as 2D arrays Edge: 28 May 24, 2006 SRF Lane CL SW
29 Imagine VLSI Implementation Chip Details 2.56cm 2 die, 0.15um process, 21M transistors, 792-pin BGA Collaboration with TI ASIC Chips arrived on April 1, 2002 Dual-Imagine test board Edge: 29 May 24, 2006
30 Application Performance (cont.) Execution time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% DEPTH MPEG QRD RTSL Average host bandwidth stalls stream controller overhead memory stalls cluster stalls kernel non main loop kernel main loop overhead operations Edge: 30 May 24, 2006
31 Applications match the bandwidth hierarchy 1000 Bandwidth (GB/s) LRF SRF DRAM 0.1 Peak DEPTH MPEG QRD RTSL Edge: 31 May 24, 2006
32 Merrimac Streaming Supercomputer Backplane Board Node 16 x XDR-DRAM 2GBytes 64GBytes/s St ream Processor 64 FPU 128 GFLOPS Node 2 Node 16 Board 2 16 Nodes 1K FPUs 2TFLOPS 32GBytes Board 32 Backplane 2 32 Boards 512 Nodes 32K FPUs 64TFLOPS 1TBytes Backplane 32 12GBytes/s pairs On-Board Network 48GBytes/s pairs 6 Teradyne GbX E/O O/E Intra-Cabinet Network 768GBytes/s 2K+2K links Ribbon Fiber Inter-Cabinet Network Bisection 24TBytes/s Scalable from 2-TFLOP workstation to 2-PFLOP supercomputer Edge: 32 May 24, 2006
33 Merrimac Application Results Application Sustained GFLOPS FP Ops / Mem Ref LRF Refs SRF Refs Mem Refs StreamFEM3D (Euler, quadratic) M (95.0%) 6.3M (3.9%) 1.8M (1.1%) StreamFEM3D (MHD, constant) M (99.4%) 7.7M (0.4%) 2.8M (0.2%) StreamMD (grid algorithm) 14.2 * 12.1 * 90.2M (97.5%) 1.6M (1.7%) 0.7M (0.8%) GROMACS 38.8 * 9.7 * 108M (95.0%) 4.2M (2.9%) 1.5M (1.3%) StreamFLO 12.9 * 7.4 * 234.3M (95.7%) 7.2M (2.9%) 3.4M (1.4%) Simulated on a machine with 64GFLOPS peak performance and no fused MADD * The low numbers are a result of many divide and square-root operations Applications achieve high performance and make good use of the bandwidth hierarchy Edge: 33 May 24, 2006
34 Other Stream Processors Edge: 34 May 24, 2006
35 Other Stream Processors GPUs GFLOP/s 10-30W ClearSpeed CSX GFLOP/s 10W STI Cell ~200 GFLOP/s 100W Edge: 35 May 24, 2006
36 Other Stream Processors Technology pushing many to build stream processors GPUs (Nvidia, ATI), Game Processors (Cell), Physics Processors (Ageia), Accelerators (Clearspeed) Many (10s-100s) of FPUs Distributed local storage Latency hiding on access to external memory Block access or deeply multithreaded All benefit from stream programming But the right architecture makes it easier and more efficient Edge: 36 May 24, 2006
37 Architecture Issues On-chip memory Read and write access to on-chip storage Producer-consumer locality demands write access Data movement between on-chip memories Without going off chip Off-chip memory No substitute for bandwidth Efficient gather and scatter required Edge: 37 May 24, 2006
38 What s Next? Edge: 38 May 24, 2006
39 ILP is mined out end of superscalar processors Time for a new architecture 1e+7 1e+6 1e+5 1e+4 1e+3 52%/year Perf (ps/inst) Linear (ps/inst) 1e+2 1e+1 1e+0 74%/year 30:1 19%/year 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Dally et al. The Last Classsical Computer, ISAT Study, 2001 Edge: 39 May 24, 2006
40 Computing landscape is changing Many function units Deep, distributed storage hierarchy Communication limited Research is needed to understand how to architect and program these processors Not an incremental fix: Fundamental rethinking of basic architecture, programming model, and compilers is required Edge: 40 May 24, 2006
41 Software Topics Exposed Communication Programming Models Abstract storage hierarchy and communication costs Portable codes with predictable performance Compiling bulk operations Strategic (vs tactical) program reorganization Scheduling bulk data transfers Size and shape of blocking Irregular computations Localization of shared neighbors Variable size results Applications Communication-efficient algorithms Edge: 41 May 24, 2006
42 Hardware Topics Close efficiency gap with hard-wired engines Gap is x today (10 for stream processors) Efficient data movement is first step Other overheads remain to be removed Storage hierarchies that can be abstracted Balancing parallelism ILP x DLP x TLP On-chip networks To connect within and between levels of the hierarchy Communication and synchronization mechanisms Drives granularity which in turn determines available parallelism Mechanisms for reuse of irregular data Edge: 42 May 24, 2006
43 Summary Communication is expensive, arithmetic is cheap Parallelism to exploit arithmetic Locality to conserve bandwidth Architectures evolving toward a deep, broad storage hierarchy Storage to hide latency, cover bandwidth taper Stream processors >10x efficiency of conventional CPUs Explicitly manage this hierarchy Makes efficient use of scarce, expensive resources Enables optimization Generalized Stream programming Bulk operations: data movement and kernels Parallelism, Locality, and Predictability Edge: 43 May 24, 2006
The End of Denial Architecture and The Rise of Throughput Computing
The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University May 19, 2009 Async Conference
More informationThe End of Denial Architecture and The Rise of Throughput Computing
The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University Outline! Performance = Parallelism
More informationStream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop
Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams
More informationEE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture
1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered
More informationEE482S Lecture 1 Stream Processor Architecture
EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered
More informationStream Processing: a New HW/SW Contract for High-Performance Efficient Computation
Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors
More informationSequoia. Mattan Erez. The University of Texas at Austin
Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler
More informationHalfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9
Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationIMAGINE: Signal and Image Processing Using Streams
IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture
More informationThe Future of GPU Computing
The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist
More informationData Parallel Architectures
EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationA Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis
A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development
More informationSequoia. Mike Houston Stanford University July 9 th, CScADS Workshop
Sequoia Mike Houston Stanford University July 9 th, 2007 - CScADS Workshop Today s outline Sequoia programming model Sequoia targets Tuning in Sequoia http://sequoia.stanford.edu - Supercomputing 2006
More informationMEMORY HIERARCHY DESIGN FOR STREAM COMPUTING
MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationStream Processing for High-Performance Embedded Systems
Stream Processing for High-Performance Embedded Systems William J. Dally Computer Systems Laboratory Stanford University HPEC September 25, 2002 Stream Proc: 1 Sept 25, 2002 Report Documentation Page Form
More informationMerrimac: Supercomputing with Streams
Merrimac: Supercomputing with Streams William J. Dally Patrick Hanrahan Mattan Erez Timothy J. Knight François Labonté Jung-Ho Ahn Nuwan Jayasena Ujval J. Kapasi Abhishek Das Jayanth Gummaraju Ian Buck
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationAnalysis and Performance Results of a Molecular Modeling Application on Merrimac
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Mattan Erez Jung Ho Ahn Ankit Garg William J. Dally Eric Darve Department of Electrical Engineering Department of Mechanical
More informationEvaluating the Imagine Stream Architecture
Evaluating the Imagine Stream Architecture Jung Ho Ahn, William J. Dally, Brucek Khailany, Ujval J. Kapasi, and Abhishek Das Computer Systems Laboratory Stanford University, Stanford, CA 94305, USA {gajh,billd,khailany,ujk,abhishek}@cva.stanford.edu
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationLecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis
Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 9. Thanks to many sources for slide material: Computer Organization and Design
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationProject Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor
EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationTHE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research
THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationA Data-Parallel Genealogy: The GPU Family Tree
A Data-Parallel Genealogy: The GPU Family Tree Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Outline Moore s Law brings
More informationFrom Brook to CUDA. GPU Technology Conference
From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationOptimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationMEMORY AND CONTROL ORGANIZATIONS OF STREAM PROCESSORS
MEMORY AND CONTROL ORGANIZATIONS OF STREAM PROCESSORS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationIMPLICIT+EXPLICIT Architecture
IMPLICIT+EXPLICIT Architecture Fortran Carte Programming Environment C Implicitly Controlled Device Dense logic device Typically fixed logic µp, DSP, ASIC, etc. Implicit Device Explicit Device Explicitly
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationWhat s New with GPGPU?
What s New with GPGPU? John Owens Assistant Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Microprocessor Scaling is Slowing
More informationStreaming Supercomputer
Chapter 5 Streaming Supercomputer The goal of the streaming supercomputer project is to develop a high-performance computer system, hardware and programming system, that achieves two orders of magnitude
More informationFault Tolerance Techniques for the Merrimac Streaming Supercomputer
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer Mattan Erez Nuwan Jayasena Timothy J. Knight William J. Dally Department of Electrical Engineering Stanford University Stanford, CA 94305
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationWhy GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationGeForce4. John Montrym Henry Moreton
GeForce4 John Montrym Henry Moreton 1 Architectural Drivers Programmability Parallelism Memory bandwidth 2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform,
More informationImplicit and Explicit Optimizations for Stencil Computations
Implicit and Explicit Optimizations for Stencil Computations By Shoaib Kamil 1,2, Kaushik Datta 1, Samuel Williams 1,2, Leonid Oliker 2, John Shalf 2 and Katherine A. Yelick 1,2 1 BeBOP Project, U.C. Berkeley
More informationSteve Scott, Tesla CTO SC 11 November 15, 2011
Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More information18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013
18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,
More informationEvolution of Computers & Microprocessors. Dr. Cahit Karakuş
Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationExploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer
More informationUC Berkeley CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 39 Intra-machine Parallelism 2010-04-30!!!Head TA Scott Beamer!!!www.cs.berkeley.edu/~sbeamer Old-Fashioned Mud-Slinging with
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationUnderstanding Sources of Inefficiency in General-Purpose Chips
Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationGPGPU. Peter Laurens 1st-year PhD Student, NSC
GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationGraphics Processor Acceleration and YOU
Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More information