Supercomputer Architecture

Size: px
Start display at page:

Download "Supercomputer Architecture"

Transcription

1 Prof. Thomas Sterling Pervasive Technology Institute School of Informatics & Computing Indiana University Dr. Steven R. Brandt Center for Computation & Technology Louisiana State University Basics of Supercomputing Supercomputer Architecture

2 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 2

3 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 3

4 New Fastest Computer in the World DEPARTMENT OF COMPUTER LOUISIANA STATE UNIVERSITY 4

5 Supercomputing System Stack Device technologies Enabling technologies for logic, memory, & communication Circuit design Computer architecture semantics and structures Models of computation governing principles Operating systems Manages resources and provides virtual machine Compilers and runtime software Maps application program to system resources, mechanisms, and semantics Programming languages, tools, & environments Algorithms Numerical techniques Means of exposing parallelism Applications End user problems, often in sciences and technology 5

6 Classes of Architecture for High Performance Computers Parallel Vector Processors (PVP) NEC Earth Simulator, SX-6 Cray-1, 2, XMP, YMP, C90, T90, X1 Fujitsu 5000 series Massively Parallel Processors (MPP) Intel Touchstone Delta & Paragon TMC CM-5 IBM SP-2 & 3, Blue Gene/Light Cray T3D, T3E, Red Storm/Strider Distributed Shared Memory (DSM) SGI Origin HP Superdome Single Instruction stream Single Data stream (SIMD) Goodyear MPP, MasPar 1 & 2, TMC CM-2 Commodity Clusters Beowulf-class PC/Linux clusters Constellations HP Compaq SC, Linux NetworX MCR 6

7 Where Does Performance Come From? Device Technology Logic switching speed and device density Memory capacity and access time Communications bandwidth and latency Computer Architecture Instruction issue rate Execution pipelining Reservation stations Branch prediction Cache management Parallelism Number of operations per cycle per processor Instruction level parallelism (ILP) Vector processing Number of processors per node Number of nodes in a system 7

8 Top 500 : System Architecture 8

9 Driving Issues/Trends Multicore Now: 8, AMD Opterons, Intel Xeon possibly 100 s will be million-way parallelism Heterogeneity GPGPU Clearspeed Cell SPE Component I/O Pins Off chip bandwidth not increasing with demand Limited number of pins Limited bandwidth per pin (pair) Cache size per core may decline Shared cache fragmentation System Interconnect Node bandwidth not increasing proportionally to core demand Power Mwatts at the high end = millions of $s per year 9

10 Multi-Core Motivation for Multi-Core Exploits improved feature-size and density Increases functional units per chip (spatial efficiency) Limits energy consumption per operation Constrains growth in processor complexity Challenges resulting from multi-core Relies on effective exploitation of multiple-thread parallelism Need for parallel computing model and parallel programming model Aggravates memory wall Memory bandwidth Way to get data out of memory banks Way to get data into multi-core processor array Memory latency Fragments (shared) L3 cache Pins become strangle point Rate of pin growth projected to slow and flatten Rate of bandwidth per pin (pair) projected to grow slowly Requires mechanisms for efficient inter-processor coordination Synchronization Mutual exclusion Context switching 10

11 Heterogeneous Multicore Architecture Combines different types of processors Each optimized for a different operational modality Performance > Nx better than other N processor types Synthesis favors superior performance For complex computation exhibiting distinct modalities Conventional co-processors Graphical processing units (GPU) Network controllers (NIC) Efforts underway to apply existing special purpose components to general applications Purpose-designed accelerators Integrated to significantly speedup some critical aspect of one or more important classes of computation IBM Cell architecture ClearSpeed SIMD attached array processor 11

12 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 12

13 Major Technology Generations (dates approximate) Electromechanical 19 th century through 1 st half of 20 th century Digital electronic with vacuum tubes 1940s Transistors 1947 Core memory 1950 SSI & MSI RTL/DTL/TTL semiconductor 1970 DRAM 1970s CMOS VLSI 1990 Multicore

14 Moore s Law Moore's Law describes a longterm trend in the history of computing hardware, in which the number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years. 14

15 The SIA ITRS Roadmap 1 0 0, M B p e r D R A 1 0, L0 o0 g0 i c T r a n s u P C l o c k ( M 1, Y e a r o f T e c 15

16 Impact of VLSI Mass produced microprocessor enabled low cost computing PCs and workstations Economy of scale Ensembles of multiple processors Microprocessor becomes building block of parallel computers Favors sequential process oriented computing Natural hardware supported execution model Requires locality management Data Control I/O channels (south bridge) provides external interface Coarse grained communication packets Suggests concurrent execution at the process boundary level Processes statically assigned to processors (one on one) Operate on local data Coordination by large value-oriented I/O messages Inter process/processor synchronization and remote data exchange 16

17 Classical DRAM Gbits per chip Historical Production Introduction Memory mats: ~ 1 Mbit each Row Decoders Primary Sense Amps Secondary sense amps & page multiplexing Timing, BIST, Interface Kerf % Chip Overhead Historical SIA Production SIA Introduction Density/Chip has dropped below 4X/3yrs And 45% of Die is Non-Memory 17

18 Microprocessors no longer realize the full potential of VLSI technology 1e+7 1e+6 1e+5 1e+4 Perf (ps/inst) Linear (ps/inst) 1e+3 1e+2 1e+1 1e+0 30:1 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Courtesy of Bill Dally, Nvidia 18

19 The Memory Wall T im Memory Access Time CPU Time Ratio M e m THE WALL 19

20 Performance Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) 1000 Moore s Law CPU µproc 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) Time Copyright 2001, UCB, David Patterson 20

21 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 21

22 Shared Memory Multiple Thread Static or dynamic Fine grained OpenMP Distributed shared memory systems Covered on day 3 CPU 1 CPU 2 CPU 3 Network memory memory memory Symmetric Multi Processor (SMP usually cache coherent ) Orion JPL NASA CPU 1 CPU 2 CPU 3 memory memory memory Network Distributed Shared Memory (DSM often not cache coherent) 22

23 SMP Context A standalone system Incorporates everything needed for computation: Processors Memory External I/O channels Local disk storage User interface Enterprise server and institutional computing market Exploits economy of scale for enhanced performance-to-cost Substantial performance Target for ISVs (Independent Software Vendors) Shared memory multiple thread programming platform Easier to program than distributed memory machines Enough parallelism to fully employ system threads (processor cores) Building block for ensemble supercomputers Commodity clusters MPPs 23

24 Major Elements of an SMP Node Processor chip DRAM main memory cards Motherboard chip set On-board memory network North bridge On-board I/O network South bridge PCI industry standard interfaces PCI, PCI-X, PCI-express System Area Network controllers e.g. Ethernet, Myrinet, Infiniband, Quadrics, Federation Switch System Management network Usually Ethernet JTAG for low level maintenance Internal disk and disk controller Peripheral interfaces 24

25 SMP Node Diagram MP L1 L2 L3 MP L1 L2 MP L1 L2 L3 MP L1 L2 Legend : MP : MicroProcessor L1,L2,L3 : Caches M1.. : Memory Banks S : Storage NIC : Network Interface Card M 1 M 2 M n-1 NIC Controller NIC S S PCI-e JTAG Ethernet Peripherals USB 25

26 Performance Issues for SMP Nodes Cache behavior Hit/miss rate Replacement strategies Prefetching Clock rate ILP Branch prediction Memory Access time Bandwidth Bank conflicts 26

27 Sample SMP Systems DELL PowerEdge HP Proliant Intel Server System Microway Quadputer IBM p

28 HyperTransport-based SMP System Source: 28

29 IBM Power 7 Processor and Core P7 Processor P7 Core 29 29

30 Processor Core Micro Architecture Execution Pipeline Stages of functionality to process issued instructions Hazards are conflicts with continued execution Forwarding supports closely associated operations exhibiting precedence constraints Out of Order Execution Uses reservation stations hides some core latencies and provide fine grain asynchronous operation supporting concurrency Branch Prediction Permits computation to proceed at a conditional branch point prior to resolving predicate value Overlaps follow-on computation with predicate resolution Requires roll-back or equivalent to correct false guesses Sometimes follows both paths, and several deep 30

31 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 31

32 What is a cache? Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers: a cache on variables First-level cache: a cache on second-level cache Second-level cache: a cache on memory Memory: a cache on disk (virtual memory) TLB: a cache on page table Branch-prediction: a cache on prediction information Proc/Regs Bigger Memory L2-Cache L1-Cache Faster Disk, Tape, etc. Copyright 2001, UCB, David Patterson 32

33 Multicore Microprocessor Component Elements Multiple processor cores One or more processors L1 caches Instruction cache Data cache L2 cache Joint instruction/data cache Dedicated to individual core processor L3 cache Not all systems Shared among multiple cores Often off die but in same package Memory interface Address translation and management (sometimes) North bridge I/O interface South bridge 33

34 Capacity Access Time Cost CPU Registers 100s Bytes < 0.5 ns (typically 1 CPU cycle) Cache L1 cache: 10s-100s K Bytes 1-5 ns $10/ Mbyte Main Memory Few G Bytes 50ns- 150ns $0.02/ MByte Disk 100s-1000s G Bytes ns ns $ 0.25/ GByte Tape infinite sec-min $0.0014/ MByte Levels of the Memory Hierarchy Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl bytes OS 512-4K bytes user/operator Mbytes Upper Level faster Larger Lower Level Copyright 2001, UCB, David Patterson 34

35 Performance: Locality Temporal Locality is a property that if a program accesses a memory location, there is a much higher than random probability that the same location would be accessed again. Spatial Locality is a property that if a program accesses a memory location, there is a much higher than random probability that the nearby locations would be accessed soon. Spatial locality is usually easier to achieve than temporal locality A couple of key factors affect the relationship between locality and scheduling : Size of dataset being processed by each processor How much reuse is present in the code processing a chunk of iterations. 35

36 Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty (500 instructions on 21264!) To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y Copyright 2001, UCB, David Patterson 36

37 Cache Performance T = I count T = I count CPI T cycle I count = I ALU + I MEM æ CPI = I ö æ ALU ç CPI ALU + I ö MEM ç CPI MEM é ë M ALU CPI ALU è ø è ø I count I count ( ) + ( M MEM ( CPI MEM-HIT + r MISS CPI MEM-MISS )) ù û T cycle T = total execution time T cycle = time for a single processor cycle I count = total number of instructions I ALU = number of ALU instructions (e.g. register register) I MEM = number of memory access instructions ( e.g. load, store) CPI = average cycles per instructions CPI ALU = average cycles per ALU instructions CPI MEM = average cycles per memory instruction r miss = cache miss rate r hit = cache hit rate CPI MEM-MISS = cycles per cache miss CPI MEM-HIT =cycles per cache hit M ALU = instruction mix for ALU instructions M MEM = instruction mix for memory access instruction 37

38 Cache Performance: Example HIT MEM MISS MEM cycle ALU MEM count CPI CPI ns T CPI I I count MEM MEM count ALU ALU MEM count ALU I I M I I M I I I 150sec )) (0.2 1) (( ) ( A MISS MEM A MISS HIT MEM A MEM hita T CPI r CPI CPI r 550sec )) (0.2 1) (( ) ( B MISS MEM B MISS HIT MEM B MEM hitb T CPI r CPI CPI r

39 Performance Shared Memory (OpenMP): Key Factors Load Balancing : mapping workloads with thread scheduling Caches : Write-through Write-back Locality : Temporal Locality Spatial Locality How Locality affects scheduling algorithm selection Synchronization : Effect of critical sections on performance 39

40 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 40

41 What is a Commodity Cluster It is a distributed/parallel computing system It is constructed entirely from commodity subsystems All subcomponents can be acquired commercially and separately Computing elements (nodes) are employed as fully operational standalone mainstream systems Two major subsystems: Compute nodes System area network (SAN) Employs industry standard interfaces for integration Uses industry standard software for majority of services Incorporates additional middleware for interoperability among elements Uses software for coordinated programming of elements in parallel 41

42 Cluster System Login & Cluster Access Resource management & scheduling subsystem MP L1 L2 L3 MP L1 L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 MP L1 L2 L3 MP L1 L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 MP L1 L2 L3 MP L1 L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 MP L1 L2 L3 MP L1 L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 Compute Nodes Controller S Controller S Controller S Controller S NIC NIC S NIC NIC S NIC NIC S NIC NIC S Interconnect Network 42

43 Clusters Dominate Top

44 44

45 UC-Berkeley NOW Project NOW SparcStation 10s and 20s originally ATM first large myrinet network NOW Ultra Sparc 170s 128 MB, 2 2GB disks, ethernet, myrinet largest Myrinet configuration in the world First cluster on the TOP500 list 45

46 Machine Parameters affecting Performance Peak floating point performance Main memory capacity Bi-section bandwidth I/O bandwidth Secondary storage capacity Organization Class of system # nodes # processors per node Accelerators Network topology Control strategy MIMD Vector, PVP SIMD SPMD 46

47 Why are Clusters so Prevalent Excellent performance to cost for many workloads Exploits economy of scale Mass produced device types Mainstream standalone subsystems Many competing vendors for similar products Just in place configuration Scalable up and down Flexible in configuration Rapid tracking of technology advance First to exploit newest component types Programmable Uses industry standard programming languages and tools User empowerment 47

48 Key Parameters for Cluster Computing Peak floating-point performance Sustained floating-point performance Main memory capacity Bi-section bandwidth I/O bandwidth Secondary storage capacity Organization Processor architecture # processors per node # nodes Accelerators Network topology Logistical Issues Power Consumption HVAC / Cooling Floor Space (Sq. Ft) 48

49 Where s the Parallelism Inter-node Multiple nodes Primary level for commodity clusters Secondary level for constellations Multi socket, intra-node Routinely 1, 2, 4, 8 Heterogeneous computing with accelerators Multi-core, intra-socket 2, 4 cores per socket Multi-thread, intra-core None or two usually ILP, intra-core Multiple operations issued per instruction Out of order, reservation stations Prefetching Accelerators 49

50 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 50

51 Fast and Gigabit Ethernet Cost effective Lucent, 3com, Cisco, etc. Directly leverage LAN technology and market Up to Mbps ports in one switch Switches can be stacked on connected with multiple gigabit links 100 Base-T: Bandwidth: > 11 MB/s Latency: < 90 microseconds 1000 Base-T: Bandwidth: ~ 50 MB/s Latency: < 90 microseconds 10 GbE: Bandwidth: 1250 MB/s Latency: ~2.5 microseconds 51

52 InfiniBand High Performance: Gbps Low latency: 1.2 microseconds Copper interconnects High availability - IEEE 802.3ad Link Aggregation / Channel Bonding 52

53 Dell PowerEdge SC1435 Opteron, IBA 53

54 Network Interconnect Topologies FAT-TREE (CLOS) TORUS 54

55 Example: 320-host Clos topology of 16-port switches 64 hosts 64 hosts 64 hosts 64 hosts 64 hosts (From Myricom) 55

56 Arete Infiniband Network 56

57

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review CISC 662 Graduate Computer Architecture Lecture 6 - Cache and virtual memory review Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

CSC Memory System. A. A Hierarchy and Driving Forces

CSC Memory System. A. A Hierarchy and Driving Forces CSC1016 1. System A. A Hierarchy and Driving Forces 1A_1 The Big Picture: The Five Classic Components of a Computer Processor Input Control Datapath Output Topics: Motivation for Hierarchy View of Hierarchy

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

CS654 Advanced Computer Architecture. Lec 2 - Introduction

CS654 Advanced Computer Architecture. Lec 2 - Introduction CS654 Advanced Computer Architecture Lec 2 - Introduction Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

ECE468 Computer Organization and Architecture. Memory Hierarchy

ECE468 Computer Organization and Architecture. Memory Hierarchy ECE468 Computer Organization and Architecture Hierarchy ECE468 memory.1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control Input Datapath Output Today s Topic:

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now? cps 14 memory.1 RW Fall 2 CPS11 Computer Organization and Programming Lecture 13 The System Robert Wagner Outline of Today s Lecture System the BIG Picture? Technology Technology DRAM A Real Life Example

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture. - Memory Hierarchies (I) COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Recap: Machine Organization

Recap: Machine Organization ECE232: Hardware Organization and Design Part 14: Hierarchy Chapter 5 (4 th edition), 7 (3 rd edition) http://www.ecs.umass.edu/ece/ece232/ Adapted from Computer Organization and Design, Patterson & Hennessy,

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control Datapath Today s Topics: technologies Technology trends Impact on performance Hierarchy The principle of locality

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Memory Hierarchy Y. K. Malaiya

Memory Hierarchy Y. K. Malaiya Memory Hierarchy Y. K. Malaiya Acknowledgements Computer Architecture, Quantitative Approach - Hennessy, Patterson Vishwani D. Agrawal Review: Major Components of a Computer Processor Control Datapath

More information

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 COMP 322: Principles of Parallel Programming Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322 Vivek Sarkar Department of Computer Science Rice

More information

HW Trends and Architectures

HW Trends and Architectures Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design ECE232: Hardware Organization and Design Lecture 21: Memory Hierarchy Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Ideally, computer memory would be large and fast

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short

More information

Multi-core Programming - Introduction

Multi-core Programming - Introduction Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

The Memory Hierarchy & Cache

The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

Trends in HPC Architectures and Parallel

Trends in HPC Architectures and Parallel Trends in HPC Architectures and Parallel Programmming Giovanni Erbacci - g.erbacci@cineca.it Supercomputing, Applications & Innovation Department - CINECA Agenda - Computational Sciences - Trends in Parallel

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0. Since 1980, CPU has outpaced DRAM... EEL 5764: Graduate Computer Architecture Appendix C Hierarchy Review Ann Gordon-Ross Electrical and Computer Engineering University of Florida http://www.ann.ece.ufl.edu/

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2009 Lecture 3: Memory Hierarchy Review: Caches 563 L03.1 Fall 2010 Since 1980, CPU has outpaced DRAM... Four-issue 2GHz superscalar accessing 100ns DRAM could

More information

CENG3420 Lecture 08: Memory Organization

CENG3420 Lecture 08: Memory Organization CENG3420 Lecture 08: Memory Organization Bei Yu byu@cse.cuhk.edu.hk (Latest update: February 22, 2018) Spring 2018 1 / 48 Overview Introduction Random Access Memory (RAM) Interleaving Secondary Memory

More information

The Memory Hierarchy Part I

The Memory Hierarchy Part I Chapter 6 The Memory Hierarchy Part I The slides of Part I are taken in large part from V. Heuring & H. Jordan, Computer Systems esign and Architecture 1997. 1 Outline: Memory components: RAM memory cells

More information

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Computer System Components

Computer System Components Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory

More information

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Topics. Digital Systems Architecture EECE EECE Need More Cache? Digital Systems Architecture EECE 33-0 EECE 9-0 Need More Cache? Dr. William H. Robinson March, 00 http://eecs.vanderbilt.edu/courses/eece33/ Topics Cache: a safe place for hiding or storing things. Webster

More information

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies 1 Lecture 22 Introduction to Memory Hierarchies Let!s go back to a course goal... At the end of the semester, you should be able to......describe the fundamental components required in a single core of

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec). Static RAM may be

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 Pipelining, Instruction Level Parallelism and Memory in Processors Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 NOTE: The material for this lecture was taken from several

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics ,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

CIT 668: System Architecture. Computer Systems Architecture

CIT 668: System Architecture. Computer Systems Architecture CIT 668: System Architecture Computer Systems Architecture 1. System Components Topics 2. Bandwidth and Latency 3. Processor 4. Memory 5. Storage 6. Network 7. Operating System 8. Performance Implications

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate to solve large problems fast Broad issues involved: Resource Allocation: Number of processing elements

More information

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from

More information

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches CS 61C: Great Ideas in Computer Architecture The Memory Hierarchy, Fully Associative Caches Instructor: Alan Christopher 7/09/2014 Summer 2014 -- Lecture #10 1 Review of Last Lecture Floating point (single

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Review: Major Components of a Computer Processor Devices Control Memory Input Datapath Output Secondary Memory (Disk) Main Memory Cache Performance

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Enabling Technologies

Enabling Technologies High Performance Computing: Concepts, Methods, & Means Enabling Technologies Prof. Thomas Sterling Department of Computer Science Louisiana State University March 13 th, 2007 Topics Introduction Taxonomy

More information