The End of Denial Architecture and The Rise of Throughput Computing

Size: px
Start display at page:

Download "The End of Denial Architecture and The Rise of Throughput Computing"

Transcription

1 The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University May 19, 2009 Async Conference UNC

2 Outline Performance = Parallelism Efficiency = Locality Single-thread processors are in denial about these two facts We are entering the era of Throughput Computing Stream Processors Parallel arithmetic units Exposed storage hierarchy Stream Programming Bulk operations GPU Computing

3 Moore s Law In 1965 Gordon Moore predicted the number of transistors on an integrated circuit would double every year. Later revised to 18 months Also predicted L 3 power scaling for constant function No prediction of processor performance Advanceds in architecture turn device scaling into performance scaling Moore, Electronics 38(8) April 19, 1965

4 The Computer Value Chain Moore s law gives us more transistors Architects turn these transistors into more performance Applications turn this performance into value for the user

5 Discontinuity 1 The End of ILP Scaling 1e+7 1e+6 Perf (ps/inst) 1e+5 1e+4 1e+3 1e+2 1e+1 19%/year 1e Dally et al. The Last Classical Computer, ISAT Study, 2001

6 Future potential of novel architecture is large (1000 vs 30) 1e+7 1e+6 1e+5 1e+4 Perf (ps/inst) Linear (ps/inst) 1e+3 1e+2 1e+1 1e+0 30:1 19%/year 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Dally et al. The Last Classical Computer, ISAT Study, 2001

7 Single-Thread Processor Performance vs Calendar Year Source: Hennessy & Patterson, CAAQA, 4th Edition

8 Technology Constraints

9 CMOS Chip is our Canvas 20mm

10 4,000 64b FPUs fit on a chip 64b FPU 20mm 0.1mm 2 50pJ/op 1.5GHz

11 200,000 16b MACs fit on a chip 64b FPU 0.1mm 2 50pJ/op 1.5GHz 20mm. 16b MAC 0.002mm 2 1pJ/op 1.5GHz

12 Moving a word across die = 124MACs, 10FMAs Moving a word off chip = 250 MACs, 20FMAs 64b FPU 0.1mm 2 50pJ/op 1.5GHz 20mm. 16b MAC 0.002mm 2 1pJ/op 1.5GHz 64b 1mm Channel 25pJ/word 10mm 250pJ, 4cycles 10mm 62pJ, 4cycles 16b 1mm Channel 6pJ/word 64b Off-Chip Channel 1nJ/word 16b Off-Chip Channel 250pJ/word 64b Floating Point 16b Fixed Point

13 Discontinuity 2 The End of L 3 Power Scaling Gordon Moore, ISSCC 2003

14 Performance = Parallelism Efficiency = Locality

15 The End of Denial Architecture Single thread processors are in denial about parallelism and locality They provide two illusions: Serial execution Denies parallelism Tries to exploit parallelism with ILP limited scalability Flat memory Denies locality Tries to provide illusion with caches very inefficient when working set doesn t fit in the cache

16 We are entering the era of Throughput Computing

17 Throughput computing is what matters now Latency optimized processors (most CPUs) are improving very slowly Little value is being delivered from the evolution of these processors Throughput optimized processors (like GPUs) are still improving at >70% per year This drives new throughput applications that convert this performance to value Going forward throughput processors matter, not latency processors

18 Applications

19 Scientific Applications Large data sets Lots of parallelism Increasingly irregular (AMR) Irregular and dynamic data structures Requires efficient gather/scatter Increasingly complex models Lots of locality Global solution sometimes bandwidth limited Less locality in these phases

20 Embedded Applications Codecs, modems, image processing, etc Lots of data (pixels, samples, etc ) Lots of parallelism Increasingly complex Lots of parallelism Lots of data dependent control High arithmetic intensity Lots of locality

21 Performance = Parallelism Efficiency = Locality Fortunately, most applications have lots of both. Amdahl s law doesn t apply to most future applications.

22 Stream Processor Architecture

23 Organize computation to Optimize use of scarce bandwidth Minimize expensive data movement Keep scarce bandwidth resources busy Take advantage of plentiful arithmetic Operate in parallel when data is local Avoid denial architecture Don t hide parallelism or locality Flat, serial model inhibits optimization

24 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Global Memory Switch LM CM Switch RM Switch RM Switch RM Switch R R R R R R R R R A A A A A A A A A

25 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Explicitly manage data movement on this hierarchy Reduces demand, increases utilization Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Face Geometry Numerical Flux Cell Geometry Elements (Current) Elements (New) Gathered Elements Cell Orientations

26 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Explicitly manage data movement Makes execution predictable Enables more arithmetic per unit bandwidth

27 Summary so far Arithmetic is cheap, bandwidth is expensive Performance = parallelism Efficiency = Locality And most applications have lots of both Optimize use of scarce bandwidth Rich, exposed storage hierarchy Explicit, bulk transfers Reduces demand, increases utilization Enables more arithmetic and predictable execution

28 What is a Stream Processor? Lots of arithmetic units To exploit parallelism Rich, exposed storage hierarchy Makes frequent communication inexpensive Makes kernel execution predictable Explicit bulk transfers (streams) and bulk operations (kernels) Load/Store machine for streams Simple control SIMD x VLIW x Threads Simple synchronization Low Overhead Local Switch Lane 0 Lane 15 LdSt ORF COM ORF ALU0 ORF ALU4 LdSt LdSt Lane Register File 0 16 KBytes InterLane Switch LdSt Local Switch LdSt Lane Register File KBytes Stream Load/Store Unit ORF COM ORF ALU0 ORF ALU4 LdSt LdSt LdSt External DRAM

29 4/6/09 Some Example Stream Processors

30 Imagine Cell 4/6/09 Forward ECC 16 RDRAM Interfaces $ bank $ bank $ bank $ bank $ bank $ bank $ bank $ bank Cluster Cluster Cluster Cluster Cluster Cluster Network Cluster Microcontroller 10.2 mm Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Mips64 20kc Cluster Mips64 20kc Mem switc h Addres s Gen Addres s Gen Addres s Gen Addres s Gen Reorder Buf fer Reorder Buf fer Example Stream Processors Merrimac 12.5 mm Nvidia GT200 Storm-1

31 Imagine b FPUs 8 lanes of 6 32b FPUs each Exposed storage hierarchy LRFs, SRF, DRAM Bulk operations Including gather/scatter Conditional streams Applications Graphics OpenGL, RTSL, Reyes, Ray Tracing Image processing Codecs Modems Network processing 20x performance/w of efficient DSPs in same technology Parity with GPUs in same technology

32 Imagine Architecture Host Processor Imagine Processor Stream Controller Micro- Controller SDRAM Mem System SRF Bank ALU cluster SDRAM Mem System SRF Bank ALU cluster Imagine Network Network Interface SRF Bank ALU cluster

33 Imagine Architecture (cont.) Hostr Processor Imagine Processor Stream Controller Micro- Controller SDRAM Mem System SRF Bank ALU cluster SDRAM Mem System SRF Bank ALU cluster Imagine Network Network Interface SRF Bank ALU cluster ADD ADD ADD MUL MUL DSQ SP COMM To/From Other Clusters Intra-cluster Switch To/From SRF

34 Producer-Consumer Locality in the Depth Extractor Memory/Global Data SRF/Streams Clusters/Kernels row of pixels previous partial sums new partial sums blurred row previous partial sums new partial sums sharpened row filtered row segment filtered row segment previous partial sums new partial sums depth map row segment Convolution (Gaussian) Convolution (Laplacian) SAD 1 : 23 : 317

35 Applications match the bandwidth hierarchy 1000 Bandwidth (GB/s) LRF SRF DRAM 0.1 Peak DEPTH MPEG QRD RTSL

36 Merrimac b FP MADD units 16 clusters of 4 units each Stream cache Shared memory Chip Crossing(s) 10kχ switch 1kχ switch LRF 100χ wire Via network Scatter-add DRAM Bank Cache Bank SRF Lane CL SW LRF 16GB/s Chip Pins and Router 64GB/s M SW 512GB/s 3,840GB/s LRF DRAM Bank Cache Bank SRF Lane CL SW LRF

37 Merrimac Application Results Application Developed Compiled Sustained GFLOP/s StreamFEM Brook StreamC 69 Speedup over Pentium4 Efficiency vs. Pentium4 StreamMD C StreamC StreamFLO Brook Metacompile d 13 StreamCDP Fortran StreamC 8 DGEMM StreamC StreamC CONV2D StreamC StreamC 79 FFT3D StreamC StreamC StreamSPAS C StreamC

38 Merrimac Bandwidth Hierarchy Application Sustained GFLOPS FP Ops / Mem Ref LRF Refs SRF Refs Mem Refs StreamFEM3D (Euler, quadratic) StreamFEM3D (MHD, constant) M (95.0%) M (99.4%) GROMACS 50.1 * * 108M (StreamMD) (95.0%) StreamFLO 12.9 * 7.4 * 234.3M (95.7%) StreamSPAS 6.2 (80% BW) M (57.6%) Dense mat-mul 116 (90% FLOPS) M (93.1%) StreamCDP M (58.9%) 6.3M (3.9%) 7.7M (0.4%) 4.2M (2.9%) 7.2M (2.9%) 1.9M (28.5%) 17.1M (6.6%) 1.9M (25.8%) 1.8M (1.1%) 2.8M (0.2%) 1.5M (1.3%) 3.4M (1.4%) 0.93M (13.9%) 0.6M (0.2%) 1.1M (15.3%) * Dominated by divide and square-root operations

39 A CUDA-Enabled GPU is The Ultimate Throughput Computer GeForce GTX 280 / Tesla T scalar processors 30SMs x 8 each > 1TFLOPS peak Exposed, hierarchical memory x performance & efficiency of a latency-optimized processor

40 GPU Throughput Computer Communication Fabric Memory & I/O Fixed Function Acceleration GeForce GTX 280 / Tesla T10

41 4/6/09 Stream Programming

42 Stream Programming: Parallelism, Locality, and Predictability Parallelism Data parallelism across stream elements Task parallelsm across kernels ILP within kernels Locality Producer/consumer Within kernels Predictability Enables scheduling K2 K1 K4 K3 4/6/09

43 Evolution of Stream Programming 1997 StreamC/KernelC Break programs into kernels and streams Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2002 Brook Continues the construct of streams and kernels Hides underlying details Too one-dimensional 2007 Sequoia Generalizes kernels to tasks Tasks operate on local data Local data gathered in an arbitrary way Inner tasks subdivide, leaf tasks compute Machine-specific details factored out

44 Stream Programming (Sequoia view) Express computation as tree of tasks Inner tasks (streams) move data to subdivide the problem Leaf tasks (kernels) solve local instance of the problem Makes leaf tasks completely predictable Enables static optimization Enables strategic optimization Stream scheduling Tuning of data movement Node memory matmul inner Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU

45 Sequoia Generalize Kernels into Leaf Tasks Perform actual computation Analogous to kernels Small working set Node memory Aggregate LS void task matmul::leaf( in float A[M][P], in float B[P][N], inout float C[M][N] ) { for (int i=0; i<m; i++) { for (int j=0; j<n; j++) { for (int k=0; k<p; k++) { C[i][j] += A[i][k] * B[k][j]; } LS 0 matmul leaf FU LS 7 FU

46 Inner tasks Decompose to smaller subtasks Recursively Larger working sets Node memory void task matmul::inner( in float A[M][P], in float B[P][N], inout float C[M][N] ) { tunable unsigned int U, X, V; blkset Ablks = rchop(a, U, X); blkset Bblks = rchop(b, X, V); blkset Cblks = rchop(c, U, V); mappar (int i=0 to M/U, int j=0 to N/V) mapreduce (int k=0 to P/X) Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU } matmul(ablks[i][k],bblks[k][j],cblks[i][j]);

47 CUDA as a Stream Language Explicit control of the memory hierarchy with shared memory shared float a[size] ; Also enables communication between threads of a CTA Transfer data up and down the hierarchy Operate on data with kernels But does allow access to arbitrary data within a kernel

48 Examples 146X 36X 19X 17X 100X 149X 47X 20X 24X 30X

49 Stream loads/stores (bulk operations) hide latency (1000s of words in flight) Cells gather Cells fn1 LD Cells 0 Flux fn2 LD Cells 1 Flux 0 scatter Cells LD Cells 2 Update 0 Cells DRAM SRFs LRFs ST Cells 0 Flux 1 LD Cells 3 Update 1 ST Cells 1 Flux 2

50 Explicit storage enables simple, efficient execution LD Cells 0 All needed data and instructions on-chip no misses LD Cells 1 Flux 0 LD Cells 2 Update 0 ST Cells 0 Flux 1 LD Cells 3 Update 1 ST Cells 1 Flux 2

51 Caches lack predictability (controlled via a wet noodle ) LD Cells 0 LD Cells 0 LD Cells 1 Flux 0 LD Cells 1 Flux 0A Miss LD Cells 2 ST Cells 0 Update 0 Flux 1 LD Cells 2 Flux 0B Update 0A Miss LD Cells 3 Update 1 Update 0B ST Cells 1 Flux 2 ST Cells 0 Flux 1

52 Caches are controlled via a wet 99% hit rate, 1 miss costs noodle 100s of cycles, 10,000s of ops LD Cells 0 LD Cells 0 LD Cells 1 Flux 0 LD Cells 1 Flux 0A Miss LD Cells 2 ST Cells 0 Update 0 Flux 1 LD Cells 2 Flux 0B Update 0A Miss LD Cells 3 Update 1 Update 0B ST Cells 1 Flux 2 ST Cells 0 Flux 1

53 Predictability enables efficient static scheduling One iteration SW Pipeline ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable

54 Bulk operations enable strategic optimization StreamFEM application Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Face Geometry Numerical Flux Cell Geometry Elements (Current) Elements (New) Gathered Elements Cell Orientations

55 Results Horizontal portability Scalar SMP Disk Cluster Cell PS3 SAXPY SGEMV SGEMM CONV2D FFT3D * GRAVITY HMMER * *Problem size reduced to fit * Reduced dataset size to fit in memory

56 System Utilization 100 SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER Percentage of Runtime 0 SMP Disk Cluster Cell PS3 56

57 Resource Utilization IBM Cell 100 Bandwidth utilization Compute utilization 0 Resource Utilization (%)

58 Search Spaces T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation. In IEEE PACT, pages , 2000.

59 Performance of Auto-tuner Conv2D SGEMM FFT3D SUmb Cell Auto Hand Cluster Auto Hand Cluster of PS3s Auto Hand Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS. For FFT3D, performances is with fusion of leaf tasks. SUmb is too complicated to be hand-tuned.

60 Conclusion

61 The Road Ahead 20x improvement in GPU performance by 2015 <2x improvement in CPU performance by 2015 Most of the value delivered to end user comes from the GPU, the throughput computer But not all things scale at the same rate Memory bandwidth scales by <5x Energy per bit-mm on chip nearly constant Energy per op only improves by 5x Each of these presents an interesting challenge

62 Conclusion Parallelism and Locality for efficient computation Denial architecture is at an end We are entering an era of Throughput Computing Heterogeneous computers Value comes from Throughput Applications running on Throughput Processors (like a GPU) Performance = parallelism, Efficiency = locality Applications have lots of both Stream processing Many ALUs exploit parallelism Rich, exposed storage hierarchy enables locality Simple control and synchronization reduces overhead Stream programming - explicit movement, bulk operations Enables strategic optimization Exposes parallelism and locality Result: performance and efficiency TOPs on a chip 20-30x efficiency of conventional processors. Performance portability GPUs are the ultimate stream processors

63 4/6/09 Backup

64 Application Performance (cont.) Execution time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% DEPTH MPEG QRD RTSL Average host bandwidth stalls stream controller overhead memory stalls cluster stalls kernel non main loop kernel main loop overhead operations 4/6/09

The End of Denial Architecture and The Rise of Throughput Computing

The End of Denial Architecture and The Rise of Throughput Computing The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University Outline! Performance = Parallelism

More information

Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006

Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006 Stream Programming: Explicit Parallelism and Locality Bill Dally Edge Workshop May 24, 2006 Edge: 1 May 24, 2006 Outline Technology Constraints Architecture Stream programming Imagine and Merrimac Other

More information

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams

More information

The Future of GPU Computing

The Future of GPU Computing The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist

More information

Sequoia. Mattan Erez. The University of Texas at Austin

Sequoia. Mattan Erez. The University of Texas at Austin Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler

More information

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9 Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with

More information

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors

More information

Sequoia. Mike Houston Stanford University July 9 th, CScADS Workshop

Sequoia. Mike Houston Stanford University July 9 th, CScADS Workshop Sequoia Mike Houston Stanford University July 9 th, 2007 - CScADS Workshop Today s outline Sequoia programming model Sequoia targets Tuning in Sequoia http://sequoia.stanford.edu - Supercomputing 2006

More information

EE482S Lecture 1 Stream Processor Architecture

EE482S Lecture 1 Stream Processor Architecture EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered

More information

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture 1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development

More information

Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis

Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 9. Thanks to many sources for slide material: Computer Organization and Design

More information

Stream Processing for High-Performance Embedded Systems

Stream Processing for High-Performance Embedded Systems Stream Processing for High-Performance Embedded Systems William J. Dally Computer Systems Laboratory Stanford University HPEC September 25, 2002 Stream Proc: 1 Sept 25, 2002 Report Documentation Page Form

More information

A Data-Parallel Genealogy: The GPU Family Tree

A Data-Parallel Genealogy: The GPU Family Tree A Data-Parallel Genealogy: The GPU Family Tree Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Outline Moore s Law brings

More information

Evaluating the Imagine Stream Architecture

Evaluating the Imagine Stream Architecture Evaluating the Imagine Stream Architecture Jung Ho Ahn, William J. Dally, Brucek Khailany, Ujval J. Kapasi, and Abhishek Das Computer Systems Laboratory Stanford University, Stanford, CA 94305, USA {gajh,billd,khailany,ujk,abhishek}@cva.stanford.edu

More information

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

From Brook to CUDA. GPU Technology Conference

From Brook to CUDA. GPU Technology Conference From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

IMAGINE: Signal and Image Processing Using Streams

IMAGINE: Signal and Image Processing Using Streams IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have

More information

What s New with GPGPU?

What s New with GPGPU? What s New with GPGPU? John Owens Assistant Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Microprocessor Scaling is Slowing

More information

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU. Peter Laurens 1st-year PhD Student, NSC GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Stream Processors. Many signal processing applications require. Programmability with Efficiency

Stream Processors. Many signal processing applications require. Programmability with Efficiency WILLIAM J. DALLY, UJVAL J. KAPASI, BRUCEK KHAILANY, JUNG HO AHN, AND ABHISHEK DAS, STANFORD UNIVERSITY Many signal processing applications require both efficiency and programmability. Baseband signal processing

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor

The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Analysis and Performance Results of a Molecular Modeling Application on Merrimac Mattan Erez Jung Ho Ahn Ankit Garg William J. Dally Eric Darve Department of Electrical Engineering Department of Mechanical

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

More information

Computer Architecture s Changing Definition

Computer Architecture s Changing Definition Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information