The University of Texas at Austin
|
|
- Aubrie McKinney
- 6 years ago
- Views:
Transcription
1 EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
2 Outline Stream Wrapup SRF design Motivation Cell architecture GPP Controller (PPE) Compute PEs (SPEs) Interconnect (EIB) Memory and I/O Comparisons Stream Processors Software (probably next time) All Cell related images and figures Sony and IBM Cell Broadband Engine Sony Corp. Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 2
3 Heat-map (Area per FPU) 64 bit Area overhead of intra-cluster switches Number of FPUs per cluster (ILP) Area overhead of an instruction sequencer Number of clusters (DLP) Area overhead of an inter-cluster switch Many reasonable hardware options for 64-bit Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture
4 Application Performance (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) (1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2) Relative runtime CONV2D DGEMM FFT3D FEM MD CDP all_seq_busy some_seq_busy_mem_busy no_seq_busy_mem_busy some_seq_busy_mem_idle Small performance differences for good streaming applications Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 4
5 SRF Decouples Execution from Memory Unpredictable I/O Latencies Static latencies cluster switch cluster switch SRF lane SRF lane Inter-cluster and memory switches cache bank DRAM bank 3,840 GB/s 512 GB/s 64 GB/s I/O pins <64 GB/s cache bank DRAM bank Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 5
6 SRF Sequential Access Single ported memory Efficient wide access of 4 contiguous words Implemented using sub arrays Reduced access time Stream buffers Reduced power Stream-buffers match bandwidth to compute needs Time multiplex the SRF port SRF bank 0 256b Compute cluster 0 64b Local WL drivers Sub array 0 Sub array 1 Sub array 2 Sub array 3 Wide single port time multiplexed by stream buffers Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 6
7 In-lane Indexing Almost Free Single ported memory Efficient wide access of 4 contiguous words Implemented using sub arrays Reduced access time Reduced power Stream-buffers match Stream buffers bandwidth to compute needs Time multiplex the SRF port Indexed SRF at low extra cost 8:1 MUX in sub-arrays Row decoder per sub-array SRF bank 0 Compute cluster 0 Addr. FIFOs Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. 512 Sub array 0 Sub array 1 Sub array 2 Sub array :1 mux Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 7
8 Stream Processors and GPUs Stream Processors GPUs Bulk kernel computation Kernel uses scalar ISA VLIW + SIMD Bulk memory operations Software latency hiding Stream mem. System HW optimized local mem. Locality opportunities Minimize off-chip transfers With capable mem system So far mostly load-time parameters Bulk kernel computation Kernel uses scalar ISA SIMD Scalar mem. Operations Threads to hide latency Threads to fill mem. Pipe Small shared memory Limited locality Rely on off-chip BW Needed for graphics Dynamic work-loads Mostly read-only Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 8
9 Conclusions Stream Processors offer extreme performance and efficiency rely on software for more efficient hardware Empower software through new interfaces Exposed locality hierarchy Exposed communication Bulk operations and hierarchical control Decouple execution pipeline from unpredictable I/O Help software when it makes sense Aggressive memory system with SIMD alignment Multiple parallelism mechanisms (can skip short-vectors ) Hardware assist for bulk operation dispatch Stream Processors offer programmable and efficiency high performance Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 9
10 Cell Broadband Engine Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 10
11 Outline Motivation Cell architecture GPP Controller (PPE) Compute PEs (SPEs) Interconnect (EIB) Memory and I/O Comparisons Stream Processors Software (probably next time) All Cell related images and figures Sony and IBM Cell Broadband Engine Sony Corp. Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 11
12 Cell Motivation Part I Performance demanding applications have different characterisitics Parallelism Locality Realtime Games, graphics, multimedia Requires redesign of HW and SW to provide efficient high performance Power, memory, frequency walls Cell designed specifically for these applications Requirements set by Sony and Toshiba Main design and architecture at IBM Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 12
13 Move to IBM Slides Rest of motivation and architecture slides taken directly from talks by Peter Hofstee, IBM Separate PDF file combined from: Slides/01_Hofstee_Cell.pdf Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 13
14 Outline Motivation Cell architecture GPP Controller (PPE) Compute PEs (SPEs) Interconnect (EIB) Memory and I/O Comparisons Stream Processors Software (probably next time) All Cell related images and figures Sony and IBM Cell Broadband Engine Sony Corp. Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 14
15 Hardware Efficiency Greater Software Responsibility Hardware matches VLSI strengths Throughput-oriented design Parallelism, locality, and partitioning Hierarchical control to simplify instruction sequencing Minimalistic HW scheduling and allocation Software given more explicit control Explicit hierarchical scheduling and latency hiding (schedule) Explicit parallelism (parallelize) Explicit locality management (localize) Must reduce HW waste but no free lunch Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 15
16 Locality
17 Storage/Bandwidth Hierarchy is Key to Efficient High Performance DRAM bank DRAM bank <64 GB/s I/O pins cache bank cache bank 64 GB/s Inter-cluster and memory switches SRF lane SRF lane 512 GB/s cluster switch cluster switch 3,840 GB/s bit MADDs (16 clusters 2 GB 512 KB 1 MB ~61 KB from: Unleashing the power: A programming example of large FFTs on Cell given by Alex Chow at power.org on 6/9/2005 Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 17
18 SRF/LS Comparison Serve as staging area for memory Capture locality as part of the storage hierarchy Single time multiplexed wide port kernel access DMA access instruction access SPs uses word granularity vs. Cell s 4-word SP s SRF has efficient auto-increment access mode Cell uses one memory for both code and data Why? Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 18
19 Parallelism
20 Three Types of Parallelism in Applications Instruction level parallelism (ILP) multiple instructions from the same instruction basic-block (loop body) that can execute together true ILP is usually quite limited (~5 - ~20 instructions) Task level Parallelism (TLP) separate high-level tasks (different code) that can be run at the same time True TLP very limited (only a few concurrent tasks) Data level parallelism (DLP) multiple iterations of a loop that can execute concurrently DLP is plentiful in scientific applications Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 20
21 Taking Advantage of ILP Multiple FUs (VLIW or superscalar) Cell has limited superscalar (not for FP) Merrimac has 4-wide VLIW FP ops Latency tolerance (pipeline parallelism) Cell has 7 FP instructions in flight Merrimac expected to have ~24 FP Merrimac uses VLIW to avoid interlocks and bypass networks Cell also emphasizes static scheduling not clear to what extent dynamic variations are allowed Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 21
22 Taking Advantage of TLP Multiple FUs (MIMD) Cell can run a different task (thread) on each SPE + asynchronous DMA on each SPE DMA must be controlled by the SPE kernel Merrimac can run a kernel and DMA concurrently DMAs fully independent of the kernels Latency tolerance concurrent execution of different kernels and their associated stream memory operations Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 22
23 Taking Advantage of DLP Multiple FUs SIMD very (most?) efficient way of utilizing parallelism Cell has 4-wide SIMD Merrimac 16-wide MIMD convert DLP to TLP and use MIMD for different tasks VLIW convert DLP to ILP and use VLIW (unrolling, SWP) Latency tolerance Overlap memory operations and kernel execution (SWP and unrolling) Take advantage of pipeline parallelism in memory Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 23
24 Memory System
25 High Bandwidth Asynchronous DMA Very high bandwidth memory system need to keep FUs busy even with storage hierarchy Cell has ~2 words/cycle (25.6GB/s ) Merrimac designed for 4 words/cycle Sophisticated DMA stride (with records) gather/scatter (with records) Differences in granularity of DMA control Merrimac treats DMA as stream level operations Cell treats DMA as kernel level operations Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 25
26 Design Choices Discussion
27 Is VLIW a Good Idea? Want to take advantage of all available ILP Merrimac has a shallower pipeline than Cell Efficiency allows exposed communication at the kernel level no need for interlocks (or at least cheaper) no bypass network Manage a larger register space distributed/clustered register file allows relaxing SIMD restrictions with communication VLIW code expansion can be a problem especially for MIMD requires recompilation interlocks allow correct (possibly inefficient) execution Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 27
28 Is MIMD a Good Idea? Probably yes for limited parallelism true for PS3, probably not for scientific applications May help irregular applications work well on Merrimac with relaxed-simd but there are overheads padding, extra node copies, SIMD conditionals overhead Multiple sequencers and extra instruction storage instructions storage might be saved by pipelining kernels not always applicable, might increase bandwidth Cell sequencer seems to be ~12% of SPE Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 28
29 Is Kernel-Level DMA Control a Good Idea? MIMD (almost) dictates kernel level control Consumes issue slots Works well for fixed-rate streams Requires multi-threading for irregular stream access stream-level DMA lets hardware handle synchronization more efficiently May allow pointer-chasing? only real advantage is making the way-too-long memory latency loop a little shorter by localizing control Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 29
30 Outline Motivation Cell architecture GPP Controller (PPE) Compute PEs (SPEs) Interconnect (EIB) Memory and I/O Comparisons Merrimac Software All Cell related images and figures Sony and IBM Cell Broadband Engine Sony Corp. Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 30
31 Parallelism in Cell PPE has SIMD VMX instructions and is 2-way SMT 8 SPEs Each SPE operates exclusively on 4-vectors Odd/even instruction pipes Odd for branching and memory / Even for compute Pipelining for taking advantage of ILP Asynchronous DMAs SWP of communication and computation Memory-level parallelism Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 31
32 Locality in Cell PPE has L1 and L2 caches L2 cache is 512KB SPE have lots of registers and local store 128 registers per SPU 256KB LS per SPU Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 32
33 Communication and Synchronization in Cell Element Interconnect Bus (EIB) Carries all data between memory system and PPE/SPEs Carries all data between SPE SPE, SPE PPE, and PPE SPE PPE L2 cache is coherent with memory SPEs are not coherent (LS is not a cache) SPE DMA coherent with L2 SPEs can DMA to/from one another Memory map LS space into global namespace Each SPE has local virtual memory translation to do this No automatic coherency explicit synchronization Synchronization through memory or mailboxes Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 33
34 Cell Software Challenges Separate code for PPE and SPEs Explicit synchronization SPEs can only access memory through DMAs DMA is asynchronous, but prep instructions are part of SPE code SW responsible for consistency and coherency SPEs must be programmed with SIMD Lots of pipeline challenges left up to programmer / compiler Deep pipeline with no branch predictor 2-wide scalar pipeline needs static scheduling LS shared by DMA, instruction fetch, and SIMD LD/ST No memory protection on LS (Stack can eat data or code) Most common programming system is just low-level API and intrinsics Cell SDK Luckily, other options exist Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 34
35 Separate Code for PPE and SPEs PPE should be doing top-level control and I/O only Trying to compute on the PPE is problematic In order 2-way SMT with deep pipeline Busy enough just trying to coordinate between the SPEs SPEs are the parallel compute engines PowerPC architecture, but really different ISA and requirements Vector only, 2-way superscalar, Two processor types really have two different software systems Synchronization can be tricky to get efficient PPE SPE should use mailbox SPE PPE should use L2 cache location Why? Spawning threads on SPEs painfully slow Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 35
36 Memory System Challenges No arbitrary global LD/ST from SPE Everything must go through DMA Strict DMA alignment and granularity restrictions All DMAs must be aligned to 16-bytes and request at least 16 bytes All DMAs should be aligned to 128-bytes and request at least 128 bytes of contiguous memory Failing to maintain alignment will cause a bus error! Banked DRAM structure Consecutive DMA requests should span 2KB (8 banks) DMA commands issued from within SPE Strided data access Chain of stride commands to have arbitrary gather Chained list of commands prepared by SPE and stored in LS Need to explicitly synchronize with DMA completion Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 36
37 SPE Programming Challenges Vector only Scalars must be placed in LSW of vector and some instructions will take this into account All casts and operations are aligned to 16 bytes (4 words) Poor integer float casting LS is only SRAM in SPE Careful with growing stack Sometimes need to force instruction prefetch to avoid stalls Priority given to DMA first, then LD/ST, and only then fetch No branch prediction (predict not-taken?) Can have branch hint instructions inserted explicitly Some help from XLC compiler (and an occasional bug) Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 37
38 Overall Lots to deal with Explicit communication and synchronization Explicit locality Explicit parallelism Explicit pipeline scheduling Need some help from programming tools Mattan Erez EE382N: Principles of Computer Architecture, Fall Lecture 24 38
Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation
Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors
More informationThe University of Texas at Austin
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total
More informationAll About the Cell Processor
All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationIntroduction to Computing and Systems Architecture
Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little
More informationRoadrunner. By Diana Lleva Julissa Campos Justina Tandar
Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner
More informationCell Programming Tips & Techniques
Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationData Parallel Architectures
EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationAmir Khorsandi Spring 2012
Introduction to Amir Khorsandi Spring 2012 History Motivation Architecture Software Environment Power of Parallel lprocessing Conclusion 5/7/2012 9:48 PM ٢ out of 37 5/7/2012 9:48 PM ٣ out of 37 IBM, SCEI/Sony,
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationAlexandria University
Alexandria University Faculty of Engineering Computer and Communications Department CC322: CC423: Advanced Computer Architecture Sheet 3: Instruction- Level Parallelism and Its Exploitation 1. What would
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More information06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli
06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationProject Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor
EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationThese slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.
11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationSony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008
Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationCISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan
CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous
More information10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache
Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationMemory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.
Week 2, Lecture 1 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. Directory-Based Coherence Idea Maintain pointers instead of simple states with each cache block. Ingredients Data owners
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationIntroduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem
Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationMemory System Design. Outline
Memory System Design Chapter 16 S. Dandamudi Outline Introduction A simple memory block Memory design with D flip flops Problems with the design Techniques to connect to a bus Using multiplexers Using
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationSequoia. Mattan Erez. The University of Texas at Austin
Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationStatic Compiler Optimization Techniques
Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch
More informationStream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop
Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationComputer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008
Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008 This exam has nine (9) problems. You should submit your answers to six (6) of these nine problems. You should not submit answers
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationEvaluating the Portability of UPC to the Cell Broadband Engine
Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity
More informationEE382 Processor Design. Concurrent Processors
EE382 Processor Design Winter 1998-99 Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and Vector Processors Slide 1 Concurrent Processors Vector processors SIMD and small clustered
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationMEMORY HIERARCHY DESIGN FOR STREAM COMPUTING
MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationHigh-Performance Processors Design Choices
High-Performance Processors Design Choices Ramon Canal PD Fall 2013 1 High-Performance Processors Design Choices 1 Motivation 2 Multiprocessors 3 Multithreading 4 VLIW 2 Motivation Multiprocessors Outline
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationEmbedded Systems: Hardware Components (part I) Todor Stefanov
Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More information