Revisiting Parallelism
|
|
- Kristina Miller
- 5 years ago
- Views:
Transcription
1 Revisiting Parallelism Sudhakar Yalamanchili, Georgia Institute of Technology Where Are We Headed? MIPS Multi-Threaded, Multi-Core Multi Threaded Era of Speculative, OOO 1000 Thread & Super Scalar Processor Level Era of Parallelism Instruction 1 Era of Pipelined Level Special 0.1 Architecture Parallelism Purpose HW Source: Shekhar Borkar, Intel Corp. ECE 4100/6100 (2)
2 Beyond ILP Performance is limited by the serial fraction parallelizable 1CPU 2CPUs 3CPUs 4CPUs Coarse grain parallelism in the post ILP era Thread, process and data parallelism Learn from the lessons of the parallel processing community Revisit the classifications and architectural techniques ECE 4100/6100 (3) Flynn s Model* Flynn s Classification Single instruction stream, single data stream (SISD) The conventional, word-sequential architecture including pipelined computers Data Level Parallelism (DLP) Single instruction stream, multiple data stream (SIMD) The multiple ALU-type architectures (e.g., array processor) Multiple instruction stream, single data stream (MISD) Not very common Thread Level Parallelism (TLP) Multiple instruction stream, multiple data stream (MIMD) The traditional multiprocessor system *M.J. Flynn, Very high speed computing systems, Proc. IEEE, vol. 54(12), pp , ECE 4100/6100 (4)
3 ILP Challenges As machine ILP capabilities increase, i.e., ILP width and depth, so do challenges OOO execution cores Key data structure sizes increase ROB, ILP window, etc. Dependency tracking logic increases quadratically VLIW/EPIC Hardware interlocks, ports, recovery logic (speculation) increases quadratically Circuit complexity increases with number of inflight instructions Data Parallelism ECE 4100/6100 (5) Example: Itanium 2 Note the percentage of the die devoted to control And this is a statically scheduled processor! ECE 4100/6100 (6)
4 Data Parallel Alternatives Single Instruction Stream Multiple Data Stream Cores Co-processors exposed through the ISA Co-processors exposed as a distinct processor Vector Processing Over 5 decades of development ECE 4100/6100 (7) The SIMD Model Single instruction stream broadcast to all processors Processors execute in lock step on local data Efficient in use of silicon area - less resources devoted to control Distributed memory model vs. shared memory model Distributed memory Each processor has local memory Data routing network whose operation is under centralized control. Processor masking for data dependent operations Shared memory Access to memory modules through an alignment network Instruction classes: computation, routing, masking ECE 4100/6100 (8)
5 Two Issues Conditional Execution Data alignment ECE 4100/6100 (9) Vector Cores Sudhakar Yalamanchili, Georgia Institute of Technology
6 Classes of Vector Processors Vector machines register machines memory machines Memory to memory architectures have seen a resurgence on chip ECE 4100/6100 (11) VMIPS Load/Store architecture Multiported registers Deeply pipelined functional units Separate scalar registers ECE 4100/6100 (12)
7 Cray Family Architecture Stream oriented Recall data skewing and concurrent memory accesses! The first load/store ISA design Cray 1 (1976) ECE 4100/6100 (13) Features of Vector Processors Significantly less dependency checking logic Order of complexity of scalar comparisons with a significantly smaller number Vector data sets Hazard free operation on deep pipelines Conciseness of representation leads to low instruction issue rate Reduction in normal control hazards Vector operations vs. a sequence of scalar operations Concurrency in operation, memory access and address generation Often statically known ECE 4100/6100 (14)
8 Some Examples ECE 4100/6100 (15) Basic Performance Concepts Consider the vector operation Z = A*X + Y Execution time t ex = t startup + n*t cycle Metrics R infinity R half R v ECE 4100/6100 (16)
9 Optimizations for Vector Machines Chaining MULT.V V1, V2. V3 ADD.V V4, V1, V5 Fine grained forwarding of elements if a vector Need additional ports on a vector register Effectively creates a deeper pipeline Conditional operations and vector masks Scatter/gather operations Vector lanes Each lane is coupled to a portion of the vector register file Lanes are transparent to the code and are like caches in the family of machines concept ECE 4100/6100 (17) The IBM Cell Processor Sudhakar Yalamanchili, Georgia Institute of Technology
10 Cell Overview M I C P P U S P U S P U S P U S P U S P U MIB S P U S P U S P U B I C R R A C IBM/Toshiba/Sony joint project years, 400 designers 234 million transistors, 4+ Ghz 256 Gflops (billions of floating pointer operations per second) 26 Gflops (double precision) Area 221 mm 2 Technology 90nm SOI ECE 4100/6100 (19) Cell Overview (cont.) One 64-bit PowerPC processor 4+ Ghz, dual issue, two threads 512 kb of second-level cache Eight Synergistic Processor Elements Or Streaming Processor Elements Co-processors with dedicated 256kB of memory (not cache) EIB data ring for internal communication Four 16 byte data rings, supporting multiple transfers 96B/cycle peak bandwidth Over 100 outstanding requests Dual Rambus XDR memory controllers (on chip) 25.6 GB/sec of memory bandwidth 76.8 GB/s chip-to-chip bandwidth (to off-chip GPU) ECE 4100/6100 (20)
11 Cell Features Security SPE dynamically reconfigurable as secure co-processor Networking SPEs might off-load networking overheads (TCP/IP) Virtualization Run multiple OSs at the same time Linux is primary development OS for Cell Broadband SPE is RISC architecture with SIMD organization and Local Store 128+ concurrent transactions to memory per processor ECE 4100/6100 (21) PPE Block Diagram PPE handles operating system and control tasks 64-bit Power ArchitectureTM with VMX In-order, 2-way hardware Multi-threading Coherent Load/Store with 32KB I & D L1 and 512KB L2 ECE 4100/6100 (22)
12 PPE Pipeline ECE 4100/6100 (23) SPE Organization and Pipeline IBM Cell SPE Organization IBM Cell SPE pipeline diagram ECE 4100/6100 (24)
13 Cell Temperature Graph Power and heat are key constrains Cell is ~80 watts at 4+ Ghz Cell has 10 temperature sensors Source: IEEE ISSCC, 2005 ECE 4100/6100 (25) SPE User-mode architecture No translation/protection within SPU DMA is full Power Arch protect/x-late Direct programmer control DMA/DMA-list Branch hint VMX-like SIMD dataflow Broad set of operations Graphics SP-Float IEEE DP-Float (BlueGene-like) Unified register file 128 entry x 128 bit 256kB Local Store Combined I & D 16B/cycle L/S bandwidth 128B/cycle DMA bandwidth ECE 4100/6100 (26)
14 Cell I/O XDR is new high-speed memory from Rambus Dual XDRTM controller 3.2Gbps) Two configurable interfaces Flexible Bandwidth between interfaces Allows for multiple system configurations Pros: Fast - dual controllers give 25GB/sed Current AMD Opteron is only 6.4GB/s Small pin count Only need a few chips for high bandwidth Cons: Expensive ($ per bit) ECE 4100/6100 (27) Multiple system support Game console systems Workstations (CPBW) HDTV Home media servers Supercomputers ECE 4100/6100 (28)
15 Programming Cell 10 virtual processors 2 threads of PowerPC 8 co-processor SPEs Communicating with SPEs 256kB local storage is NOT a cache Must explicitly move data in and out of local store Use DMA engine (supports scatter/gather) ECE 4100/6100 (29) Programming Cell Multiple-ISA hand tuned programs Explicit SIMD coding SIMD alignment directives Shared memory, single program abstraction Automatic tuning for each ISA Automatic SIMDization Automatic parallelization Explicit parallelization Highest with local performance memories with help from programmers Highest Productivity with fully automatic compiler technology ECE 4100/6100 (30)
16 Execution Model SPE executables are embedded as readonly data in the PPE executable Use the memory flow controller (MFC) for DMA operations The shopping list view of memory accesses Source: IBM ECE 4100/6100 (31) Programming Model SPE Program /* spe_foo.c A C program to be compiled into an executable called "spe_foo" */ int main(unsigned long long speid, addr64 argp, addr64 envp) { int i; /* func_foo would be the real code */ i = func_foo(argp); return i; } PPE Program /* spe_runner.c A C program to be linked with spe_foo and run on the PPE. */ extern spe_program_handle_t spe_foo; int main() { int rc, status = 0; speid_t spe_id; spe_id = spe_create_thread(0, &spe_foo, 0, NULL, -1, 0); rc = spe_wait(spe_id, &status, 0); return status; } blocking call Source: IBM ECE 4100/6100 (32)
17 SPE Programming Dual Issue with issue constraints Predication and hints, no branch prediction hardware Alignment instructions Source: IBM ECE 4100/6100 (33) Programming Idioms: Pipeline ECE 4100/6100 (34)
18 Programming Idioms: Work Queue Model work queue Pull data off of a shared queue Self scheduled ECE 4100/6100 (35) SPMD & MIMD Accelerators Executing same (SPMD) or different (MPMD) programs ECE 4100/6100 (36)
19 Cell Processor Application Areas Digital content creation (games and movies) Game playing and game serving Distribution of (dynamic, media rich) content Imaging and image processing Image analysis (e.g. video surveillance) Next-generation physics-based visualization Video conferencing (3D) Streaming applications (codecs etc.) Physical simulation & science ECE 4100/6100 (37) Some References and Links ahle.html ct05.pdf Hofstee.pdf -cellprocessor_final.pdf ham-isscc05.pdf ECE 4100/6100 (38)
20 IRAM Cores Sudhakar Yalamanchili, Georgia Institute of Technology Data Parallelism and the Processor Memory Gap Moore s Law CPU µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) 1 DRAM DRAM 7%/yr. Time How can we close this gap? ECE 4100/6100 (40)
21 The Effects of the Processor- Memory Gap Tolerate gap with deeper cache memories increasing worst case performance System level impact: Alpha I & D cache access: 2 clocks L2 cache: 6 clocks L3 cache: 8 clocks Memory: 76 clocks DRAM component access: 18 clocks How much time is spent in the memory hierarchy? SpecInt92: 22% Specfp92: 32% Database: 77% Sparse matrix: 73% ECE 4100/6100 (41) Where do the Transistors Go? Processor % Area %Transistors (-cost) (-power) Alpha % 77% StrongArm SA110 61% 94% Pentium Pro 64% 88% Caches have no inherent value, they simply recover bandwidth? ECE 4100/6100 (42)
22 Impact of DRAM Capacity Increasing capacity creates a quandary Continual four fold increase in density increases minimum memory increment for a given width How do we match the memory bus width? Cost/bit issues for wider DRAM chips die size, testing, package costs Number of DRAM chips decrease decrease in concurrency ECE 4100/6100 (43) Merge Logic and DRAM! Bring the processors to memory Tremendous on-chip bandwidth for predictable application reference patterns Enough memory to hold complete programs and data feasible More applications are limited by memory speed Better memory latency for applications with irregular access patterns Synchronous DRAMs to integrate with the higher speed logic compatible ECE 4100/6100 (44)
23 Potential: IRAM for Lower Latency DRAM Latency Dominant delay = RC of the word lines Keep wire length short & block sizes small? ns for 64b-256b IRAM RAS/CAS? ECE 4100/6100 (45) Potential for IRAM Bandwidth Mbit modules(1gb), each 256b wide 20 ns RAS/CAS = 320 GBytes/sec If cross bar switch delivers 1/3 to 2/3 of BW of 20% of modules GBytes/sec FYI: AlphaServer 8400 = 1.2 GBytes/sec 75 MHz, 256-bit memory bus, 4 banks ECE 4100/6100 (46)
24 IRAM Applications PDAs, cameras, gameboys, cell phones, pagers Database systems? Database demand: 2X / 9 months 100 Greg s Law Database-Proc. Performance Gap: 10 Moore s Law µproc speed 2X / 18 months Processor-Memory Performance Gap: 1 DRAM speed 2X /120 months ECE 4100/6100 (47) Estimating IRAM Performance Direct application produces modest performance improvements Architectures were designed to overcome the memory bottleneck Architectures were not designed to use tremendous memory bandwidth Need to rethink the design! Tailor architecture to utilize the high bandwidth ECE 4100/6100 (48)
25 Emerging Embedded Applications and Characteristics Fastest growing application domain Video processing, speech recognition, 3D Graphics Set top boxes, game consoles, PDAs Data parallel Typically low temporal locality Size, weight and power constraints Highest speed not necessarily the best processor What about the role of ILP processors here? Real Time constraints Right data at the right time ECE 4100/6100 (49) SIMD/Vector Architectures VIRAM - Vector IRAM Logic is slow in DRAM process Put a vector unit in a DRAM and provide a port between a traditional processor and the vector IRAM instead of a whole processor in DRAM Source: Berkeley Vector IRAM ECE 4100/6100 (50)
26 ISA LD/SD vector ISA defined as a co-processor to the MIPS 64 ISA Vector register file with 32 entries Each can be configured as 64b, 32b, or 16b Integer or FP elements Two scalar register files Memory and exception handling, base addresses and stride information Scalar operands Flag registers Special Limited scope instructions to permute contents of vector registers Integer instructions for saturated arithmetic ECE 4100/6100 (51) MIMD Machines P + C P + C P + C P + C Dir Dir Dir Dir Memory Memory Memory Memory Interconnection Network Parallel processing has catalyzed the development of a several generations of parallel processing machines Unique features include the interconnection network, support for system wide synchronization, and programming languages/compilers ECE 4100/6100 (52)
27 Basic Models for Parallel Programs Shared Memory Coherency/consistency are driving concerns Programming model is simplified at the expense of system complexity Message Passing Typically implemented on distributed memory machines System complexity is simplified at the expense of increased effort by the programmer ECE 4100/6100 (53) Shared Memory Vs. Message Passing Shared memory Simplifies software development Increases complexity of hardware Power directories, coherency enforcement logic More recently transactional memory Message passing doesn t need centralized bus Simplifies hardware Scalable memory and interconnect bandwidth Increases complexity of software development Increases the burden on the developer ECE 4100/6100 (54)
28 Two Emerging Challenges Programming Models and Compilers? Source: Intel Corp. Source: IBM Interconnection Networks ECE 4100/6100 (55)
Parallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationAll About the Cell Processor
All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,
More informationCell Broadband Engine. Spencer Dennis Nicholas Barlow
Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationRoadrunner. By Diana Lleva Julissa Campos Justina Tandar
Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner
More informationVector IRAM: A Microprocessor Architecture for Media Processing
IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology
More informationChapter 4 Data-Level Parallelism
CS359: Computer Architecture Chapter 4 Data-Level Parallelism Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 4.1 Introduction 4.2 Vector Architecture
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationTechnology Trends Presentation For Power Symposium
Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationSony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008
Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationINF5063: Programming heterogeneous multi-core processors Introduction
INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationDEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationLecture 8: RISC & Parallel Computers. Parallel computers
Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More information06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli
06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationIntroduction to Computing and Systems Architecture
Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little
More informationCSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review
CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationNext Generation Technology from Intel Intel Pentium 4 Processor
Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business
More informationOptimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP
Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Michael Gschwind IBM T.J. Watson Research Center Cell Design Goals Provide the platform for the future of computing 10
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCray XE6 Performance Workshop
Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed
More informationComputer System Components
Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More informationComputer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2018 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationanced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer
Contents advanced anced computer architecture i FOR m.tech (jntu - hyderabad & kakinada) i year i semester (COMMON TO ECE, DECE, DECS, VLSI & EMBEDDED SYSTEMS) CONTENTS UNIT - I [CH. H. - 1] ] [FUNDAMENTALS
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationMain Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec). Static RAM may be
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More information