The Next Revolution in Computer Systems Architecture

Size: px
Start display at page:

Download "The Next Revolution in Computer Systems Architecture"

Transcription

1 The Next Revolution in Computer Systems Architecture Richard Oehler Corporate Fellow Office of the CTO University of Mannheim 2/08/07

2 Computer Systems Architecture Not just the Processor Chip It s all the Chips and Interconnects Chipsets, Memory, SMP Fabric,, It s the Packaging Form Factors, Power/Cooling It s the Total System 2

3 What s Been Happening Chip X Chip X Chip X Chip X Chip X Chip X Chip X Chip X MCP MCP MCP MCP MCP MCP MCP MCP SRQ Crossbar Mem.Ctrlr HT SRQ Crossbar Mem.Ctrlr HT USB USB PCI PCI Hub Hub Memory Controller Hub Hub PCI-E PCI-E Bridge PCI-E Bridge PCI-E Bridge PCIe Bridge PCIe TM TM Bridge Bridge SRQ Crossbar Mem.Ctrlr HT 8 GB/S SRQ Crossbar Mem.Ctrlr HT 8 GB/S 8 GB/S XMB XMB XMB XMB XMB XMB XMB XMB PCIe PCIe TM TM Bridge Bridge PCIe PCIe TM TM Bridge Bridge 8 GB/S Hub Hub USB USB PCI PCI Legacy x86 Architecture 20-year old traditional front-side bus (FSB) architecture CPUs, Memory, all share a bus Major bottleneck to performance Faster CPUs or more cores performance AMD64 s Direct Connect Architecture Industry-standard technology Direct Connect Architecture reduces FSB bottlenecks HyperTransport interconnect offers scalable high bandwidth and low latency 4 memory controllers increases memory capacity and bandwidth 3

4 4P System Board Layout 4

5 AMD s Building Blocks Today, Tomorrow and the Future 12.8 GB/s 128-bit DRAM MCT Core 0 Core 1 SRI DRAM MCT Core 0 Core 1 SRI ncht HT-HB HT-HB XBAR HT HT XBAR HT HT cht HT HT XBAR HT HT XBAR HT-HB HT-HB 4.0GB/s per 2GT/s Data Rate Core 0 Core 1 SRI MCT Core 0 Core 1 SRI MCT NorthBridge DRAM DRAM 5

6 Lessons Learned #1 Allocation of XBAR Command buffer across Virtual Channels can have big impact on performance MP traffic analysis gives the best allocation e.g. Opteron Read Transaction Request (2 visits) Probe (3 visits) Response (8 visits) Memory 0 3: RD 4: D 8:SD L2 L2 5: D 4: RP0 3: PI1 3: PI2 P 0 P 2 3: PI0 P 1 2: RD 4: PI1 5: RP1 7: SD 1: RD P 3 Memory 2 6: D 4: RP2 5: RP0 Memory 1 L2 L2 Memory 3 6

7 Lessons Learned #2 Memory Latency is the Key to Application Performance! System Performance Performance (single 2.8GHz core, 400MHz vs DDR2 Average PC3200, 2GT/s HT Memory with 1MB cache in MP Latency system) (single 2.8GHz 7 core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache 120% in MP system) System Performance OLTP1 2 OLTP2 1 SW99 0 SSL JBB Performance vs Average Memory Latency OLTP1 OLTP2 SW99 SSL JBB 1N 2N 4N (SQ) 8N (TL) 8N (L) 1 Node P0 P0 P1 4 Node Square 1N 2N 4N (SQ) 8N (TL) 8N (L) AvgD 0 hops 1 hops 1.8 hops Latency x + 0ns x + 44ns (124 cpuclk) x + 105ns (234 cpuclk) P2 P3 P0 P1 P2 P3 100% 80% 60% 40% 20% 0% 8N Ladder P4 Processor Performance P5 P6 P7 120% 100% 80% 60% 40% 20% 0% Processor Performance 1 Node 4 Node Square P0 8N Twisted Ladder P2 P4 P6 8N Ladder P0 P0 2 Node P0 P1 P2 P1 P3 P5 P7 P0 P2 P4 P6 0.5 hops 1.5 hops P1 P3 x + 17ns (47 cpuclk) x + 76ns (214 cpuclk) P1 P3 P5 P7 7

8 What s About To Happen University of Mannheim 2/08/07

9 Barcelona Native quad-core upgrade for 2007 Native Quad- Core Processor To increase performanceper-watt efficiencies using the same Thermal Design Power. Platform Compatibility Socket and thermal compatible with Socket F. Advanced Process Technology 65nm Silicon-on Insulator Process Fast transistors with low power leakage to reduce power and heat. Direct Connect Architecture Integrated memory controller designed for reduced memory latency and increased performance Memory directly connected Provides fast CPU-to-CPU communication CPUs directly connected Glueless SMP up to 8 sockets 9

10 The Barcelona Processor (4 core/die) Comprehensive Upgrades for SSE128 Virtualization Performance Expandable shared L3 cache Advanced Power Management IPC-enhanced CPU cores More delivered DRAM Bandwidth 10

11 Trends in DRAM bandwidth Improved Efficiency is the Answer Higher per-socket bandwidth demands Diverse streams increase conflicts DRAM efficiency declining We must improve delivered DRAM bandwidth 11

12 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Write bursting DRAM prefetcher Core prefetchers 12

13 Balanced, Highly Efficient Cache Structure Dedicated L1 Locality keeps most critical data in the L1 cache Lowest latency 2 loads per cycle Core 1 Cache Control Core 2 Cache Control Core 3 Cache Control Core 4 Cache Control Dedicated L2 Sized to accommodate the majority of working sets today Dedicated to eliminate conflicts common in shared caches Better for Virtualization Shared L3 NEW Victim-cache architecture maximizes efficiency of cache hierarchy Fills from L3 leave likely shared lines in the L3 Sharing-aware replacement policy Ready for expansion at the right time for customers 64KB 512KB 64KB 512KB 2MB 64KB 512KB 64KB 512KB 13

14 Quad-core System Power 17.6 watts Native Quad-Core 8 GB/S Native Quad-Core 190 watts (95w per CPU) 8 GB/S 8 GB/S 17.6 watts 2P System 190 watts for processors 16 watts for chipset 35.2 watts for DDR2 8 GB/S 10W PCI-E Bridge PCI-E Bridge 6W Hub Hub USB USB PCI PCI Direct Connect Savings: No external memory controller saves 25 watts No FBDIMM saves 48 watts System power is the metric that matters to our customers. Direct Connect helps reduce system power. 14

15 Additional HyperTransport Ports Enable Fully Connected 4 Node (four x16 HT) and 8 Node (eight x8 HT) Reduced network diameter Fewer hops to memory Increased Coherent Bandwidth more links cht packets visit fewer links HyperTransport3 Benefits Low latency because of lower diameter Evenly balanced utilization of HyperTransport links Low queuing delays Low latency under load Four x16 HT OR Eight x8 HT 15

16 4 Node Performance 4 Node Square 4 Node fully connected P0 16 P2 P0 16 P2 P1 P3 P1 P3 4N SQ (2GT/s HyperTransport) Diam 2 Avg Diam 1.00 XFIRE BW 14.9GB/s + 2 EXTRA LINKS 4N FC (2GT/s HyperTransport) Diam 1 Avg Diam 0.75 XFIRE BW 29.9GB/s (2X) W/ HYPERTRANSPORT3 4N FC (4.4GT/s HyperTransport3) Diam 1 Avg Diam 0.75 XFIRE BW 65.8GB/s (4X) 16 XFIRE ( crossfire ) BW is the link-limited all-to-all communication bandwidth (data only)

17 8 Node Performance 8 Node Fully Connected P0 P1 8N Twisted Ladder 16 P2 P4 P3 P5 P6 P P Node 6HT 2x4 8 Node 6HT 2x4 P P0 P1 8 P2 P3 OR P0 P1 P5 P4 8 P7 P P P P4 P5 P6 P7 P2 P3 8N TL (2GT/s HyperTransport) Diam 3 Avg Diam 1.62 XFIRE BW 15.2GB/s 8N 2x4 (4.4GT/s HyperTransport3) Diam 2 Avg Diam 1.12 XFIRE BW 72.2GB/s (5X) 8N FC (4.4GT/s HyperTransport3) Diam 1 Avg Diam 0.88 XFIRE BW 94.4GB/s (6X) 17

18 Interesting Questions University of Mannheim 2/08/07

19 What About Hardware Multi-threading Most implementations today Don t achieve consistent results Some programs work well, some not so well but OK, some work well only with it turned off Varies by with an application that contains many programs Worse yet, varies within a program based on data input. Doesn t justify hardware costs Unless the hardware is targeted at a specific market that is known to benefit from multi-threading Increased complexity, chip area, schedule, risk Given uncertain nature of benefit, becomes difficult for software to manage Always On or Off? Selectively On, based on program, or program data or But there is hope Carefully analysis, upfront SMT design, best practices engineering can lead to real benefits 19

20 Going Beyond Four cores Opteron s fundamental design continues to scale Relatively easy to get to beyond four cores Multi chip packaging v. single chip Next generation of process technology (45nm) will get us most way there Will individual cores of the same design be faster in next generation process technology? Expect somewhat but not as much as previous generations Circuit Stability v. Power/thermal limits What about caches? Need for larger L2, L3 Not just Scaling effects, but memory bandwidth limitations Issues in SRAM design Reliability - More transistors per cell (going from 6T to 8T or more) Alternative technology readiness ZRAM, E-DRAM And Power? Even at slower speeds, Processor scaling is still good 20

21 Looking Beyond 8 Cores University of Mannheim 2/08/07

22 Challenges Going Beyond 8 Cores Sufficient Memory Bandwidth Sufficient IO Bandwidth Single thread performance Extracting and managing multi-threading Power/Cooling Heterogeneous Cores Asymmetric Homogeneous Cores Chip Level Packaging Cost/Benefit Yield & RAS Increase in Soft Errors as feature size continues to shrink More exposure in Logic than Memory 22

23 Sufficient Memory Bandwidth AMD has optimized memory bandwidth with many advanced techniques for current Opterons Both to reduce usage (e.g.. write combining) and to use empty cycles (e.g.. prefetching) Law of diminishing returns Require a balanced design ~ 30% loads/stores in typical instruction mixes Increasing number of cores increases total IPC by IPC of added cores*scaling Current Designs are balanced for commercial and scientific workloads Reduced scaling (by a lot) if cores are not balanced with memory bandwidth How to get more memory bandwidth More memory channels, Wider memory channels Pin/Wiring issues for more/wider channels Increasing memory speed means reduced number of DIMMs Or has significantly longer latency (FBDIMM) Stacked DIMMs may be a partial solution More bandwidth with much longer latency is not the right answer OK for streaming date, not good for more random access Many examples where significantly increasing cache improves cache hits thereby reducing overall memory bandwidth When is it better to use larger caches and less core Application specific or not? 23

24 Sufficient Bandwidth Opteron has really good Best of breed Bandwidth and Connectivity Require a balanced design Rules of thumb 1b i/o per instruction executed (commercial workloads - 70 s,80 s) b i/o per instruction executed (more graphic content 90 s) b i/o per instruction executed (managed code 00 s) Assuming 1Giga Instructions/Sec, 8 cores will require between 1GB/sec and 1.8 GB/Sec sustained transfers Now consider the efficiencies of realizing that bandwidth turn around time Average size of packets Overhead bytes v. payload channel loading 8 cores needs between 1.5 GB/sec and 5 GB/sec Today s best (PCI- Express realizes 2.5 to 5/GB/sec on 8/16 lanes Adding SMT and pushing the number of cores x2 or x4 either Adds more i/o ports per chip or reduces inter-chip connectivity Massive Connectivity (think TPC-C like environments) reduces efficiency of i/o Need more i/o ports based on nature of workload 24

25 Single Thread Performance Single thread performance still matters Most current metrics are based on it Limited multi-thread benchmarks Many environments are not highly multi-threaded Digital Media Client Space Dusty Decks Bulk of market does not have high degrees of multi threading How many design points are need to cover the major markets? In what order of introduction? Needed: more really parallel environments along with tools to mange same 25

26 Extracting and managing multithreading Some applications do not multi tread well Either not expressed well Or have limit inherent parallelism Many applications have good multi thread potential Not currently organized for multi threading Lack of/poor compilers, debuggers, OS support, tools for multi threading Writing correct and efficient multi thread applications is hard Some estimates indicate less than 2% of the programming population can do it well Very limited tools to find and extract automatically algorithm or non-coded parallelism One of the hardest problems in computer science Been worked for the last years with very little success 26

27 Power/Cooling Limits on power/cooling is how we got to multi-core to begin with Caused major rethink in how processors are designed Contribute a significant percent to overall box power Multi Chip makes problem worse Performance (or Price/Performance)/watt and Performance (or Price/Performance)/watt/cubic density are the new metrics Pushing for a large number of cores/die while holding existing thermal envelopes results in slower cores New Designs save power in many ways Selectively power reducing sections of cores or whole cores based on use Overall performance limiting based on a maximum power consumption But core multiplier is still significant when each core consumes 3-5 watts Not just a core/die problem, but a system problem Amount of total memory to provide a balanced system Not just a system problem but a customer problem Physical limitations on providing more power/cooling Limited by various utility issues Power has become a serious TCO issue 27

28 Types of Cores: Homogeneous, Asymmetric, Sequester, Highly Reliable, Today large number Homogeneous Cores/Threads are real issue for OS management Thread Dispatch queues are heavily contested requiring new locking protocols and management Thread numbers bigger than OS design point Number of threads packed in some word or double word structure Major upheaval in OS Build it and they will come Maybe not if overhead cancels expected improvement Asymmetric Homogeneous Cores are an even more difficult problem Complicated hardware test and bring up Discovery and reporting to the OS Managing dispatching based on core type is non-trivial Consider moving thread from slower to fast core based on new availability of faster core Made even more complicated when balancing core power levels with overall application completion time 28

29 Types of Cores - continued Asymmetric Cores make matters worse Now must balance threads against different core types as well as all the previous issues Sequested special purpose cores are a type of Asymmetric cores For your favorite special application or part thereof needs to run on a special Licensing Issues are non-trivial Hidden from the OS Accessed though device Drivers or Libraries or APIs Not just for software usage Hardware prefetching, Hardware binary translation or Optimization Opens up possible designs for high or very high reliability Torrenza is an example of a asymmetric core Sequested or not Evolving from less efficient interconnect Expect to see Torrenza cores as an early first instance of asymmetric cores 29

30 Chip Level Packaging Many degrees of freedom Larger chips v. Multi-Chip module Mixed cores and caches v. some cache on separate chip Pin count v. cost of package Can be used to put two (or more) muti-core chips together Often used to go the next level of multi cores without having a full next level design Alternative structure separates some of cache hierarchy from cores Significantly large, but slower access time caches MCM works best if it uses an internal interconnect for local chips Not the standard external coherency interface Internal NB interfaces MCM Cores to MCM cores MCM cores to L3 (or beyond) cache Real cost breaks in increasing package/pin sizes Moving from plastic to ceramic is very non-linear Multi-core designs almost always need more pins, too many pins forces more specialized pin spacing and associated increasd in manufacturing and assembly cost Different packages for different markets Acceptable cost v. market size Volume is in digital media then client space 30

31 Cost/Benefit As degree of multi core goes up What is level of scaling? Can be very good if a balanced system is maintained If scaling falls off, design gets limited to more specialized applications Economics of design, manufacturing change significantly as market contracts What about modular designs? Highly structured Mix and match Increased high level design time Most of design side cost is in debug/verification Function of how many different actual designs Manufacturing cost increase Different parts, SKUs Demand prediction risks Inventory management risks Different manufacturing line optimizations Especially with different die sizes Low level tuning and process adaptation more complicated Because there is more sizes What are the cross over points? Market Economics, Available capital and resources 31

32 Yield and RAS As technology shrinks, density of circuits for a given die size significantly increases Defects more likely to make die less than perfect Will need a partial good strategy May need to use sparing at core level Already done for some caches Increase in Soft Errors as feature size continues to shrink More difficult in Logic than Memory Error detection and correction and sparing common practice in caches Until now Soft Logic Errors have not been a major issue harder than memory errors to detect and correct need new methodology and design tools When is it better to design redundant cores or even TMR cores, than continue to add more and more detection and correction logic to individual cores? Yield curves, design complexity will determine cross over Redundant cores in specialized systems can be useful in markets that require very highly available systems Should multi core chips be internally partitionable, especially in 24/7 environment? For service, diagnostics Where is cross-over between Scale out v. scale up? Varies with application mix 32

33 How Big is Big Enough? Why go bigger than 8? Consider 8 cores per die and 8 dies per system 64 threads Add in SMT number is at least doubled Position such a system against today s or near term largest SMP Such multi-core/multi-chip systems are more powerful (by any measure) than what is currently in the market Are there individual applications that need this much compute power? Biggest TPC-C benchmarks can be run easily on such an 8x8 a system Do real applications today or near future need this size? Very few if any As more cores are added, the number of applications requiring such power diminish non linearly What about other environments? Server consolidation using virtualization scale can this large or larger if market demands it But will need to carefully measure throughput v. reliability Hosted clients can be another such environment Emerging market What are the tradeoffs between increasing multi-core v. more chips per system? Economics To go beyond 8 cores per die will require very reliable dies and systems MTBF terms are multiplied not subtracted How does all of this relate to scale up? Good question Market leaning toward scale up But with sufficiently large SMPs 33

34 AMD Continues to Lead with Customer- Centric x86 Innovations The Next Big Things 34

35 Continuum of Solutions "Torrenza" Fusion" HTX Accelerator PCI-E PCIe Accelerator Add-in Chipset Accelerator Chipset AMD Processor Accelerator Accelerator CPU Package level integration (MCM) C P U NB Accelerator Silicon level integration "Stream" general purpose GPU Opteron Socket Socket compatible accelerator Accelerated Processors 35

36 Redefining Computer Systems Architecture Torrenza & Fusion 36

37 First AMD Fusion Product: Accelerated Processor Combining CPU and GPU FUSION VISION Create the optimal computing experience for an increasingly mobile, graphics- and media-centric world Deliver improvements in microprocessor performance-per-watt-per-dollar over today s CPU-only architectures Continue to scale x86 by enabling new x86 computing paradigms, classes and form factors 37

38 The Next Big Things AMD, the company that first brought the x86 industry... Simultaneous 32-bit/64-bit computing The integrated memory controller HyperTransport Technology Native multi-core The truly open x86 platform Is once again changing the world of computing: Torrenza and Fusion 38

39 Backup 39

40 Abstract Computer Systems Architecture is going through a major re-thinking. Constraints from form factors, to scaling, to power/cooling, to system balance, have overwhelmed current designs. This talk will discus these reasons and a few others that have caused this to happen, what some of the new design ideas/parameters are, how they will manifest themselves in systems in the not too distant future and what more needs to be done before there is a new stable base. 40

41 Talk Outline Computer Systems Architecture Today s World Form Factors, Scaling, Power/Cooling, System balance, Not So New Ideas Multi-core Accelerators Heterogeneous Issues to be Solved Future Directions 41

42 Delivering more DRAM bandwidth Concurrency Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW More DRAM banks reduces page conflicts Longer burst length improves command efficiency Write bursting DRAM prefetcher Core prefetchers 42

43 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Increase page hits, decrease page conflicts History-based pattern predictor Write bursting DRAM prefetcher Core prefetchers 43

44 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher bw Write bursting Increase buffer sizes Optimize schedulers Ready to support future DRAM technologies DRAM prefetcher Core prefetchers 44

45 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Write bursting Minimize Rd/Wr Turnaround DRAM prefetcher Core prefetchers 45

46 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Write bursting DRAM prefetcher Track positive and negative, unit and nonunit strides Dedicated buffer for prefetched data Aggressively fill idle DRAM cycles Core prefetchers 46

47 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Write bursting DRAM prefetcher Core prefetchers 47 DC Prefetcher fills directly to L1 Cache IC Prefetcher more flexible 2 outstanding requests to any address

The Future of Computing: AMD Vision

The Future of Computing: AMD Vision The Future of Computing: AMD Vision Tommy Toles AMD Business Development Executive thomas.toles@amd.com 512-327-5389 Agenda Celebrating Momentum Years of Leadership & Innovation Current Opportunity To

More information

Multi-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality

Multi-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality Multi-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality Michael Wall Principal Member of Technical Staff, Advanced Micro Devices, Inc. Game Developers Conference

More information

Six-Core AMD Opteron Processor

Six-Core AMD Opteron Processor What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy

More information

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management

More information

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

AMD s Next Generation Microprocessor Architecture

AMD s Next Generation Microprocessor Architecture AMD s Next Generation Microprocessor Architecture Fred Weber October 2001 Goals Build a next-generation system architecture which serves as the foundation for future processor platforms Enable a full line

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

AMD Opteron 4200 Series Processor

AMD Opteron 4200 Series Processor What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Main Memory Systems. Department of Electrical Engineering Stanford University   Lecture 5-1 Lecture 5 Main Memory Systems Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 5-1 Announcements If you don t have a group of 3, contact us ASAP HW-1 is

More information

Power Technology For a Smarter Future

Power Technology For a Smarter Future 2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Power Technology For a Smarter Future Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation

More information

AMD Server/Workstation Customer Presentation

AMD Server/Workstation Customer Presentation AMD Server/Workstation Customer Presentation Systemhaus Maitschke Inh. Gerald Maitschke Tel. +49 89 94004804 Fax. +49 89 71034015 Mobil. +49 171 3357041 gerald@maitschke.de www.maitschke.de Agenda AMD

More information

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor POWER7: IBM's Next Generation Server Processor Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002 Outline

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

ECE 486/586. Computer Architecture. Lecture # 2

ECE 486/586. Computer Architecture. Lecture # 2 ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:

More information

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor

More information

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye Negotiating the Maze Getting the most out of memory systems today and tomorrow Robert Kaye 1 System on Chip Memory Systems Systems use external memory Large address space Low cost-per-bit Large interface

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Quad-core Press Briefing First Quarter Update

Quad-core Press Briefing First Quarter Update Quad-core Press Briefing First Quarter Update AMD Worldwide Server/Workstation Marketing C O N F I D E N T I A L Outstanding Dual-core Performance Toady Average of scores places AMD ahead by 2% Average

More information

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system Objective To understand the major factors that dictate performance when using GPU as an compute co-processor

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA The AMD64 Technology for Server and Workstation Dr. Ulrich Knechtel Enterprise Program Manager EMEA Agenda Direct Connect Architecture AMD Opteron TM Processor Roadmap Competition OEM support The AMD64

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology The Computer Revolution Progress in computer technology Underpinned by Moore

More information

Lecture 18: DRAM Technologies

Lecture 18: DRAM Technologies Lecture 18: DRAM Technologies Last Time: Cache and Virtual Memory Review Today DRAM organization or, why is DRAM so slow??? Lecture 18 1 Main Memory = DRAM Lecture 18 2 Basic DRAM Architecture Lecture

More information

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Nathan Woods XtremeData FPGA 2007 Outline Background Problem Statement Possible Solutions Description

More information

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007 Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster

More information

Hardware Evolution in Data Centers

Hardware Evolution in Data Centers Hardware Evolution in Data Centers 2004 2008 2011 2000 2013 2014 Trend towards customization Increase work done per dollar (CapEx + OpEx) Paolo Costa Rethinking the Network Stack for Rack-scale Computers

More information

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

Maximizing Six-Core AMD Opteron Processor Performance with RHEL Maximizing Six-Core AMD Opteron Processor Performance with RHEL Bhavna Sarathy Red Hat Technical Lead, AMD Sanjay Rao Senior Software Engineer, Red Hat Sept 4, 2009 1 Agenda Six-Core AMD Opteron processor

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

NVIDIA nforce IGP TwinBank Memory Architecture

NVIDIA nforce IGP TwinBank Memory Architecture NVIDIA nforce IGP TwinBank Memory Architecture I. Memory Bandwidth and Capacity There s Never Enough With the recent advances in PC technologies, including high-speed processors, large broadband pipelines,

More information

Four-Socket Server Consolidation Using SQL Server 2008

Four-Socket Server Consolidation Using SQL Server 2008 Four-Socket Server Consolidation Using SQL Server 28 A Dell Technical White Paper Authors Raghunatha M Leena Basanthi K Executive Summary Businesses of all sizes often face challenges with legacy hardware

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

The Memory Hierarchy 1

The Memory Hierarchy 1 The Memory Hierarchy 1 What is a cache? 2 What problem do caches solve? 3 Memory CPU Abstraction: Big array of bytes Memory memory 4 Performance vs 1980 Processor vs Memory Performance Memory is very slow

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

White Paper. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)

White Paper. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem) White Paper First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem) Introducing a New Dynamically and Design- Scalable Microarchitecture that Rewrites the Book On Energy Efficiency

More information

Open Innovation with Power8

Open Innovation with Power8 2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Open Innovation with Power8 Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation 2013

More information

AMD Opteron Processors In the Cloud

AMD Opteron Processors In the Cloud AMD Opteron Processors In the Cloud Pat Patla Vice President Product Marketing AMD DID YOU KNOW? By 2020, every byte of data will pass through the cloud *Source IDC 2 AMD Opteron In The Cloud October,

More information

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor Hot Chips 21 POWER7: IBM's Next Generation Server Processor Ronald Kalla Balaram Sinharoy POWER7 Chief Engineer POWER7 Chief Core Architect Acknowledgment: This material is based upon work supported by

More information

SU Dual and Quad-Core Xeon UP Server

SU Dual and Quad-Core Xeon UP Server SU4-1300 Dual and Quad-Core Xeon UP Server www.eslim.co.kr Dual and Quad-Core Server Computing Leader!! ESLIM KOREA INC. 1. Overview eslim SU4-1300 The ideal entry-level server Intel Xeon processor 3000/3200

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

The Computer Revolution. Classes of Computers. Chapter 1

The Computer Revolution. Classes of Computers. Chapter 1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition 1 Chapter 1 Computer Abstractions and Technology 1 The Computer Revolution Progress in computer technology Underpinned by Moore

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System This Unit: Main Memory Building a Memory System Application OS Compiler Firmware CPU I/O Memory Digital Circuits Gates & Transistors Memory hierarchy review DRAM technology A few more transistors Organization:

More information

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates

More information

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 > Agenda Sun s x86 1. Sun s x86 Strategy 2. Sun s x86 Product Portfolio 3. Virtualization < 1 > 1. SUN s x86 Strategy Customer Challenges Power and cooling constraints are very real issues Energy costs are

More information

Advanced Computer Architecture (CS620)

Advanced Computer Architecture (CS620) Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).

More information

The future is parallel but it may not be easy

The future is parallel but it may not be easy The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the

More information

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs IO 1 Today IO 2 Key Points CPU interface and interaction with IO IO devices The basic structure of the IO system (north bridge, south bridge, etc.) The key advantages of high speed serial lines. The benefits

More information

ABySS Performance Benchmark and Profiling. May 2010

ABySS Performance Benchmark and Profiling. May 2010 ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

BREAKING THE MEMORY WALL

BREAKING THE MEMORY WALL BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

EPYC VIDEO CUG 2018 MAY 2018

EPYC VIDEO CUG 2018 MAY 2018 AMD UPDATE CUG 2018 EPYC VIDEO CRAY AND AMD PAST SUCCESS IN HPC AMD IN TOP500 LIST 2002 TO 2011 2011 - AMD IN FASTEST MACHINES IN 11 COUNTRIES ZEN A FRESH APPROACH Designed from the Ground up for Optimal

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation SAS Enterprise Miner Performance on IBM System p 570 Jan, 2008 Hsian-Fen Tsao Brian Porter Harry Seifert IBM Corporation Copyright IBM Corporation, 2008. All Rights Reserved. TABLE OF CONTENTS ABSTRACT...3

More information

Building blocks for custom HyperTransport solutions

Building blocks for custom HyperTransport solutions Building blocks for custom HyperTransport solutions Holger Fröning 2 nd Symposium of the HyperTransport Center of Excellence Feb. 11-12 th 2009, Mannheim, Germany Motivation Back in 2005: Quite some experience

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

A unified multicore programming model

A unified multicore programming model A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging

More information

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES Greg Hankins APRICOT 2012 2012 Brocade Communications Systems, Inc. 2012/02/28 Lookup Capacity and Forwarding

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

Multi-Core Microprocessor Chips: Motivation & Challenges

Multi-Core Microprocessor Chips: Motivation & Challenges Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005

More information

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk! Robert Jamieson Robs Techie PP Everything in this presentation is at your own risk! PC s Today Basic Setup Hardware pointers PCI Express How will it effect you Basic Machine Setup Set the swap space Min

More information

I/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1

I/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1 Memory (RAM) Standard Industry Memory Module (SIMM) RDRAM and SDRAM Access to RAM is extremely slow compared to the speed of the processor Memory busses (front side busses FSB) run at 100MHz to 800MHz

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology

More information

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability

More information

Amdahl's Law in the Multicore Era

Amdahl's Law in the Multicore Era Amdahl's Law in the Multicore Era Explain intuitively why in the asymmetric model, the speedup actually decreases past a certain point of increasing r. The limiting factor of these improved equations and

More information

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018 ECE 172 Digital Systems Chapter 15 Turbo Boost Technology Herbert G. Mayer, PSU Status 8/13/2018 1 Syllabus l Introduction l Speedup Parameters l Definitions l Turbo Boost l Turbo Boost, Actual Performance

More information

Transistors and Wires

Transistors and Wires Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis Part II These slides are based on the slides provided by the publisher. The slides

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Memory: Past, Present and Future Trends Paolo Faraboschi

Memory: Past, Present and Future Trends Paolo Faraboschi Memory: Past, Present and Future Trends Paolo Faraboschi Fellow, Hewlett Packard Labs Systems Research Lab Quiz ( Excerpt from Intel Developer Forum Keynote 2000 ) ANDREW GROVE: is there a role for more

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information