The Next Revolution in Computer Systems Architecture
|
|
- Buddy Cameron
- 5 years ago
- Views:
Transcription
1 The Next Revolution in Computer Systems Architecture Richard Oehler Corporate Fellow Office of the CTO University of Mannheim 2/08/07
2 Computer Systems Architecture Not just the Processor Chip It s all the Chips and Interconnects Chipsets, Memory, SMP Fabric,, It s the Packaging Form Factors, Power/Cooling It s the Total System 2
3 What s Been Happening Chip X Chip X Chip X Chip X Chip X Chip X Chip X Chip X MCP MCP MCP MCP MCP MCP MCP MCP SRQ Crossbar Mem.Ctrlr HT SRQ Crossbar Mem.Ctrlr HT USB USB PCI PCI Hub Hub Memory Controller Hub Hub PCI-E PCI-E Bridge PCI-E Bridge PCI-E Bridge PCIe Bridge PCIe TM TM Bridge Bridge SRQ Crossbar Mem.Ctrlr HT 8 GB/S SRQ Crossbar Mem.Ctrlr HT 8 GB/S 8 GB/S XMB XMB XMB XMB XMB XMB XMB XMB PCIe PCIe TM TM Bridge Bridge PCIe PCIe TM TM Bridge Bridge 8 GB/S Hub Hub USB USB PCI PCI Legacy x86 Architecture 20-year old traditional front-side bus (FSB) architecture CPUs, Memory, all share a bus Major bottleneck to performance Faster CPUs or more cores performance AMD64 s Direct Connect Architecture Industry-standard technology Direct Connect Architecture reduces FSB bottlenecks HyperTransport interconnect offers scalable high bandwidth and low latency 4 memory controllers increases memory capacity and bandwidth 3
4 4P System Board Layout 4
5 AMD s Building Blocks Today, Tomorrow and the Future 12.8 GB/s 128-bit DRAM MCT Core 0 Core 1 SRI DRAM MCT Core 0 Core 1 SRI ncht HT-HB HT-HB XBAR HT HT XBAR HT HT cht HT HT XBAR HT HT XBAR HT-HB HT-HB 4.0GB/s per 2GT/s Data Rate Core 0 Core 1 SRI MCT Core 0 Core 1 SRI MCT NorthBridge DRAM DRAM 5
6 Lessons Learned #1 Allocation of XBAR Command buffer across Virtual Channels can have big impact on performance MP traffic analysis gives the best allocation e.g. Opteron Read Transaction Request (2 visits) Probe (3 visits) Response (8 visits) Memory 0 3: RD 4: D 8:SD L2 L2 5: D 4: RP0 3: PI1 3: PI2 P 0 P 2 3: PI0 P 1 2: RD 4: PI1 5: RP1 7: SD 1: RD P 3 Memory 2 6: D 4: RP2 5: RP0 Memory 1 L2 L2 Memory 3 6
7 Lessons Learned #2 Memory Latency is the Key to Application Performance! System Performance Performance (single 2.8GHz core, 400MHz vs DDR2 Average PC3200, 2GT/s HT Memory with 1MB cache in MP Latency system) (single 2.8GHz 7 core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache 120% in MP system) System Performance OLTP1 2 OLTP2 1 SW99 0 SSL JBB Performance vs Average Memory Latency OLTP1 OLTP2 SW99 SSL JBB 1N 2N 4N (SQ) 8N (TL) 8N (L) 1 Node P0 P0 P1 4 Node Square 1N 2N 4N (SQ) 8N (TL) 8N (L) AvgD 0 hops 1 hops 1.8 hops Latency x + 0ns x + 44ns (124 cpuclk) x + 105ns (234 cpuclk) P2 P3 P0 P1 P2 P3 100% 80% 60% 40% 20% 0% 8N Ladder P4 Processor Performance P5 P6 P7 120% 100% 80% 60% 40% 20% 0% Processor Performance 1 Node 4 Node Square P0 8N Twisted Ladder P2 P4 P6 8N Ladder P0 P0 2 Node P0 P1 P2 P1 P3 P5 P7 P0 P2 P4 P6 0.5 hops 1.5 hops P1 P3 x + 17ns (47 cpuclk) x + 76ns (214 cpuclk) P1 P3 P5 P7 7
8 What s About To Happen University of Mannheim 2/08/07
9 Barcelona Native quad-core upgrade for 2007 Native Quad- Core Processor To increase performanceper-watt efficiencies using the same Thermal Design Power. Platform Compatibility Socket and thermal compatible with Socket F. Advanced Process Technology 65nm Silicon-on Insulator Process Fast transistors with low power leakage to reduce power and heat. Direct Connect Architecture Integrated memory controller designed for reduced memory latency and increased performance Memory directly connected Provides fast CPU-to-CPU communication CPUs directly connected Glueless SMP up to 8 sockets 9
10 The Barcelona Processor (4 core/die) Comprehensive Upgrades for SSE128 Virtualization Performance Expandable shared L3 cache Advanced Power Management IPC-enhanced CPU cores More delivered DRAM Bandwidth 10
11 Trends in DRAM bandwidth Improved Efficiency is the Answer Higher per-socket bandwidth demands Diverse streams increase conflicts DRAM efficiency declining We must improve delivered DRAM bandwidth 11
12 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Write bursting DRAM prefetcher Core prefetchers 12
13 Balanced, Highly Efficient Cache Structure Dedicated L1 Locality keeps most critical data in the L1 cache Lowest latency 2 loads per cycle Core 1 Cache Control Core 2 Cache Control Core 3 Cache Control Core 4 Cache Control Dedicated L2 Sized to accommodate the majority of working sets today Dedicated to eliminate conflicts common in shared caches Better for Virtualization Shared L3 NEW Victim-cache architecture maximizes efficiency of cache hierarchy Fills from L3 leave likely shared lines in the L3 Sharing-aware replacement policy Ready for expansion at the right time for customers 64KB 512KB 64KB 512KB 2MB 64KB 512KB 64KB 512KB 13
14 Quad-core System Power 17.6 watts Native Quad-Core 8 GB/S Native Quad-Core 190 watts (95w per CPU) 8 GB/S 8 GB/S 17.6 watts 2P System 190 watts for processors 16 watts for chipset 35.2 watts for DDR2 8 GB/S 10W PCI-E Bridge PCI-E Bridge 6W Hub Hub USB USB PCI PCI Direct Connect Savings: No external memory controller saves 25 watts No FBDIMM saves 48 watts System power is the metric that matters to our customers. Direct Connect helps reduce system power. 14
15 Additional HyperTransport Ports Enable Fully Connected 4 Node (four x16 HT) and 8 Node (eight x8 HT) Reduced network diameter Fewer hops to memory Increased Coherent Bandwidth more links cht packets visit fewer links HyperTransport3 Benefits Low latency because of lower diameter Evenly balanced utilization of HyperTransport links Low queuing delays Low latency under load Four x16 HT OR Eight x8 HT 15
16 4 Node Performance 4 Node Square 4 Node fully connected P0 16 P2 P0 16 P2 P1 P3 P1 P3 4N SQ (2GT/s HyperTransport) Diam 2 Avg Diam 1.00 XFIRE BW 14.9GB/s + 2 EXTRA LINKS 4N FC (2GT/s HyperTransport) Diam 1 Avg Diam 0.75 XFIRE BW 29.9GB/s (2X) W/ HYPERTRANSPORT3 4N FC (4.4GT/s HyperTransport3) Diam 1 Avg Diam 0.75 XFIRE BW 65.8GB/s (4X) 16 XFIRE ( crossfire ) BW is the link-limited all-to-all communication bandwidth (data only)
17 8 Node Performance 8 Node Fully Connected P0 P1 8N Twisted Ladder 16 P2 P4 P3 P5 P6 P P Node 6HT 2x4 8 Node 6HT 2x4 P P0 P1 8 P2 P3 OR P0 P1 P5 P4 8 P7 P P P P4 P5 P6 P7 P2 P3 8N TL (2GT/s HyperTransport) Diam 3 Avg Diam 1.62 XFIRE BW 15.2GB/s 8N 2x4 (4.4GT/s HyperTransport3) Diam 2 Avg Diam 1.12 XFIRE BW 72.2GB/s (5X) 8N FC (4.4GT/s HyperTransport3) Diam 1 Avg Diam 0.88 XFIRE BW 94.4GB/s (6X) 17
18 Interesting Questions University of Mannheim 2/08/07
19 What About Hardware Multi-threading Most implementations today Don t achieve consistent results Some programs work well, some not so well but OK, some work well only with it turned off Varies by with an application that contains many programs Worse yet, varies within a program based on data input. Doesn t justify hardware costs Unless the hardware is targeted at a specific market that is known to benefit from multi-threading Increased complexity, chip area, schedule, risk Given uncertain nature of benefit, becomes difficult for software to manage Always On or Off? Selectively On, based on program, or program data or But there is hope Carefully analysis, upfront SMT design, best practices engineering can lead to real benefits 19
20 Going Beyond Four cores Opteron s fundamental design continues to scale Relatively easy to get to beyond four cores Multi chip packaging v. single chip Next generation of process technology (45nm) will get us most way there Will individual cores of the same design be faster in next generation process technology? Expect somewhat but not as much as previous generations Circuit Stability v. Power/thermal limits What about caches? Need for larger L2, L3 Not just Scaling effects, but memory bandwidth limitations Issues in SRAM design Reliability - More transistors per cell (going from 6T to 8T or more) Alternative technology readiness ZRAM, E-DRAM And Power? Even at slower speeds, Processor scaling is still good 20
21 Looking Beyond 8 Cores University of Mannheim 2/08/07
22 Challenges Going Beyond 8 Cores Sufficient Memory Bandwidth Sufficient IO Bandwidth Single thread performance Extracting and managing multi-threading Power/Cooling Heterogeneous Cores Asymmetric Homogeneous Cores Chip Level Packaging Cost/Benefit Yield & RAS Increase in Soft Errors as feature size continues to shrink More exposure in Logic than Memory 22
23 Sufficient Memory Bandwidth AMD has optimized memory bandwidth with many advanced techniques for current Opterons Both to reduce usage (e.g.. write combining) and to use empty cycles (e.g.. prefetching) Law of diminishing returns Require a balanced design ~ 30% loads/stores in typical instruction mixes Increasing number of cores increases total IPC by IPC of added cores*scaling Current Designs are balanced for commercial and scientific workloads Reduced scaling (by a lot) if cores are not balanced with memory bandwidth How to get more memory bandwidth More memory channels, Wider memory channels Pin/Wiring issues for more/wider channels Increasing memory speed means reduced number of DIMMs Or has significantly longer latency (FBDIMM) Stacked DIMMs may be a partial solution More bandwidth with much longer latency is not the right answer OK for streaming date, not good for more random access Many examples where significantly increasing cache improves cache hits thereby reducing overall memory bandwidth When is it better to use larger caches and less core Application specific or not? 23
24 Sufficient Bandwidth Opteron has really good Best of breed Bandwidth and Connectivity Require a balanced design Rules of thumb 1b i/o per instruction executed (commercial workloads - 70 s,80 s) b i/o per instruction executed (more graphic content 90 s) b i/o per instruction executed (managed code 00 s) Assuming 1Giga Instructions/Sec, 8 cores will require between 1GB/sec and 1.8 GB/Sec sustained transfers Now consider the efficiencies of realizing that bandwidth turn around time Average size of packets Overhead bytes v. payload channel loading 8 cores needs between 1.5 GB/sec and 5 GB/sec Today s best (PCI- Express realizes 2.5 to 5/GB/sec on 8/16 lanes Adding SMT and pushing the number of cores x2 or x4 either Adds more i/o ports per chip or reduces inter-chip connectivity Massive Connectivity (think TPC-C like environments) reduces efficiency of i/o Need more i/o ports based on nature of workload 24
25 Single Thread Performance Single thread performance still matters Most current metrics are based on it Limited multi-thread benchmarks Many environments are not highly multi-threaded Digital Media Client Space Dusty Decks Bulk of market does not have high degrees of multi threading How many design points are need to cover the major markets? In what order of introduction? Needed: more really parallel environments along with tools to mange same 25
26 Extracting and managing multithreading Some applications do not multi tread well Either not expressed well Or have limit inherent parallelism Many applications have good multi thread potential Not currently organized for multi threading Lack of/poor compilers, debuggers, OS support, tools for multi threading Writing correct and efficient multi thread applications is hard Some estimates indicate less than 2% of the programming population can do it well Very limited tools to find and extract automatically algorithm or non-coded parallelism One of the hardest problems in computer science Been worked for the last years with very little success 26
27 Power/Cooling Limits on power/cooling is how we got to multi-core to begin with Caused major rethink in how processors are designed Contribute a significant percent to overall box power Multi Chip makes problem worse Performance (or Price/Performance)/watt and Performance (or Price/Performance)/watt/cubic density are the new metrics Pushing for a large number of cores/die while holding existing thermal envelopes results in slower cores New Designs save power in many ways Selectively power reducing sections of cores or whole cores based on use Overall performance limiting based on a maximum power consumption But core multiplier is still significant when each core consumes 3-5 watts Not just a core/die problem, but a system problem Amount of total memory to provide a balanced system Not just a system problem but a customer problem Physical limitations on providing more power/cooling Limited by various utility issues Power has become a serious TCO issue 27
28 Types of Cores: Homogeneous, Asymmetric, Sequester, Highly Reliable, Today large number Homogeneous Cores/Threads are real issue for OS management Thread Dispatch queues are heavily contested requiring new locking protocols and management Thread numbers bigger than OS design point Number of threads packed in some word or double word structure Major upheaval in OS Build it and they will come Maybe not if overhead cancels expected improvement Asymmetric Homogeneous Cores are an even more difficult problem Complicated hardware test and bring up Discovery and reporting to the OS Managing dispatching based on core type is non-trivial Consider moving thread from slower to fast core based on new availability of faster core Made even more complicated when balancing core power levels with overall application completion time 28
29 Types of Cores - continued Asymmetric Cores make matters worse Now must balance threads against different core types as well as all the previous issues Sequested special purpose cores are a type of Asymmetric cores For your favorite special application or part thereof needs to run on a special Licensing Issues are non-trivial Hidden from the OS Accessed though device Drivers or Libraries or APIs Not just for software usage Hardware prefetching, Hardware binary translation or Optimization Opens up possible designs for high or very high reliability Torrenza is an example of a asymmetric core Sequested or not Evolving from less efficient interconnect Expect to see Torrenza cores as an early first instance of asymmetric cores 29
30 Chip Level Packaging Many degrees of freedom Larger chips v. Multi-Chip module Mixed cores and caches v. some cache on separate chip Pin count v. cost of package Can be used to put two (or more) muti-core chips together Often used to go the next level of multi cores without having a full next level design Alternative structure separates some of cache hierarchy from cores Significantly large, but slower access time caches MCM works best if it uses an internal interconnect for local chips Not the standard external coherency interface Internal NB interfaces MCM Cores to MCM cores MCM cores to L3 (or beyond) cache Real cost breaks in increasing package/pin sizes Moving from plastic to ceramic is very non-linear Multi-core designs almost always need more pins, too many pins forces more specialized pin spacing and associated increasd in manufacturing and assembly cost Different packages for different markets Acceptable cost v. market size Volume is in digital media then client space 30
31 Cost/Benefit As degree of multi core goes up What is level of scaling? Can be very good if a balanced system is maintained If scaling falls off, design gets limited to more specialized applications Economics of design, manufacturing change significantly as market contracts What about modular designs? Highly structured Mix and match Increased high level design time Most of design side cost is in debug/verification Function of how many different actual designs Manufacturing cost increase Different parts, SKUs Demand prediction risks Inventory management risks Different manufacturing line optimizations Especially with different die sizes Low level tuning and process adaptation more complicated Because there is more sizes What are the cross over points? Market Economics, Available capital and resources 31
32 Yield and RAS As technology shrinks, density of circuits for a given die size significantly increases Defects more likely to make die less than perfect Will need a partial good strategy May need to use sparing at core level Already done for some caches Increase in Soft Errors as feature size continues to shrink More difficult in Logic than Memory Error detection and correction and sparing common practice in caches Until now Soft Logic Errors have not been a major issue harder than memory errors to detect and correct need new methodology and design tools When is it better to design redundant cores or even TMR cores, than continue to add more and more detection and correction logic to individual cores? Yield curves, design complexity will determine cross over Redundant cores in specialized systems can be useful in markets that require very highly available systems Should multi core chips be internally partitionable, especially in 24/7 environment? For service, diagnostics Where is cross-over between Scale out v. scale up? Varies with application mix 32
33 How Big is Big Enough? Why go bigger than 8? Consider 8 cores per die and 8 dies per system 64 threads Add in SMT number is at least doubled Position such a system against today s or near term largest SMP Such multi-core/multi-chip systems are more powerful (by any measure) than what is currently in the market Are there individual applications that need this much compute power? Biggest TPC-C benchmarks can be run easily on such an 8x8 a system Do real applications today or near future need this size? Very few if any As more cores are added, the number of applications requiring such power diminish non linearly What about other environments? Server consolidation using virtualization scale can this large or larger if market demands it But will need to carefully measure throughput v. reliability Hosted clients can be another such environment Emerging market What are the tradeoffs between increasing multi-core v. more chips per system? Economics To go beyond 8 cores per die will require very reliable dies and systems MTBF terms are multiplied not subtracted How does all of this relate to scale up? Good question Market leaning toward scale up But with sufficiently large SMPs 33
34 AMD Continues to Lead with Customer- Centric x86 Innovations The Next Big Things 34
35 Continuum of Solutions "Torrenza" Fusion" HTX Accelerator PCI-E PCIe Accelerator Add-in Chipset Accelerator Chipset AMD Processor Accelerator Accelerator CPU Package level integration (MCM) C P U NB Accelerator Silicon level integration "Stream" general purpose GPU Opteron Socket Socket compatible accelerator Accelerated Processors 35
36 Redefining Computer Systems Architecture Torrenza & Fusion 36
37 First AMD Fusion Product: Accelerated Processor Combining CPU and GPU FUSION VISION Create the optimal computing experience for an increasingly mobile, graphics- and media-centric world Deliver improvements in microprocessor performance-per-watt-per-dollar over today s CPU-only architectures Continue to scale x86 by enabling new x86 computing paradigms, classes and form factors 37
38 The Next Big Things AMD, the company that first brought the x86 industry... Simultaneous 32-bit/64-bit computing The integrated memory controller HyperTransport Technology Native multi-core The truly open x86 platform Is once again changing the world of computing: Torrenza and Fusion 38
39 Backup 39
40 Abstract Computer Systems Architecture is going through a major re-thinking. Constraints from form factors, to scaling, to power/cooling, to system balance, have overwhelmed current designs. This talk will discus these reasons and a few others that have caused this to happen, what some of the new design ideas/parameters are, how they will manifest themselves in systems in the not too distant future and what more needs to be done before there is a new stable base. 40
41 Talk Outline Computer Systems Architecture Today s World Form Factors, Scaling, Power/Cooling, System balance, Not So New Ideas Multi-core Accelerators Heterogeneous Issues to be Solved Future Directions 41
42 Delivering more DRAM bandwidth Concurrency Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW More DRAM banks reduces page conflicts Longer burst length improves command efficiency Write bursting DRAM prefetcher Core prefetchers 42
43 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Increase page hits, decrease page conflicts History-based pattern predictor Write bursting DRAM prefetcher Core prefetchers 43
44 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher bw Write bursting Increase buffer sizes Optimize schedulers Ready to support future DRAM technologies DRAM prefetcher Core prefetchers 44
45 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Write bursting Minimize Rd/Wr Turnaround DRAM prefetcher Core prefetchers 45
46 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Write bursting DRAM prefetcher Track positive and negative, unit and nonunit strides Dedicated buffer for prefetched data Aggressively fill idle DRAM cycles Core prefetchers 46
47 Delivering more DRAM bandwidth Independent DRAM controllers Optimized DRAM paging Re-architect NB for higher BW Write bursting DRAM prefetcher Core prefetchers 47 DC Prefetcher fills directly to L1 Cache IC Prefetcher more flexible 2 outstanding requests to any address
The Future of Computing: AMD Vision
The Future of Computing: AMD Vision Tommy Toles AMD Business Development Executive thomas.toles@amd.com 512-327-5389 Agenda Celebrating Momentum Years of Leadership & Innovation Current Opportunity To
More informationMulti-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality
Multi-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality Michael Wall Principal Member of Technical Staff, Advanced Micro Devices, Inc. Game Developers Conference
More informationSix-Core AMD Opteron Processor
What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy
More informationThe mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management
Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management
More informationFacilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit
Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationAMD s Next Generation Microprocessor Architecture
AMD s Next Generation Microprocessor Architecture Fred Weber October 2001 Goals Build a next-generation system architecture which serves as the foundation for future processor platforms Enable a full line
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationAMD Opteron 4200 Series Processor
What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationMain Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1
Lecture 5 Main Memory Systems Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 5-1 Announcements If you don t have a group of 3, contact us ASAP HW-1 is
More informationPower Technology For a Smarter Future
2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Power Technology For a Smarter Future Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation
More informationAMD Server/Workstation Customer Presentation
AMD Server/Workstation Customer Presentation Systemhaus Maitschke Inh. Gerald Maitschke Tel. +49 89 94004804 Fax. +49 89 71034015 Mobil. +49 171 3357041 gerald@maitschke.de www.maitschke.de Agenda AMD
More informationPOWER7: IBM's Next Generation Server Processor
POWER7: IBM's Next Generation Server Processor Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002 Outline
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More informationECE 486/586. Computer Architecture. Lecture # 2
ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:
More informationTECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS
TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor
More informationNegotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye
Negotiating the Maze Getting the most out of memory systems today and tomorrow Robert Kaye 1 System on Chip Memory Systems Systems use external memory Large address space Low cost-per-bit Large interface
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers
William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationQuad-core Press Briefing First Quarter Update
Quad-core Press Briefing First Quarter Update AMD Worldwide Server/Workstation Marketing C O N F I D E N T I A L Outstanding Dual-core Performance Toady Average of scores places AMD ahead by 2% Average
More informationCS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system
CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system Objective To understand the major factors that dictate performance when using GPU as an compute co-processor
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationLecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )
Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case
More informationThe AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA
The AMD64 Technology for Server and Workstation Dr. Ulrich Knechtel Enterprise Program Manager EMEA Agenda Direct Connect Architecture AMD Opteron TM Processor Roadmap Competition OEM support The AMD64
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology The Computer Revolution Progress in computer technology Underpinned by Moore
More informationLecture 18: DRAM Technologies
Lecture 18: DRAM Technologies Last Time: Cache and Virtual Memory Review Today DRAM organization or, why is DRAM so slow??? Lecture 18 1 Main Memory = DRAM Lecture 18 2 Basic DRAM Architecture Lecture
More informationIntegrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective
Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Nathan Woods XtremeData FPGA 2007 Outline Background Problem Statement Possible Solutions Description
More informationThread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007
Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster
More informationHardware Evolution in Data Centers
Hardware Evolution in Data Centers 2004 2008 2011 2000 2013 2014 Trend towards customization Increase work done per dollar (CapEx + OpEx) Paolo Costa Rethinking the Network Stack for Rack-scale Computers
More informationMaximizing Six-Core AMD Opteron Processor Performance with RHEL
Maximizing Six-Core AMD Opteron Processor Performance with RHEL Bhavna Sarathy Red Hat Technical Lead, AMD Sanjay Rao Senior Software Engineer, Red Hat Sept 4, 2009 1 Agenda Six-Core AMD Opteron processor
More informationXT Node Architecture
XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationMulti-threading technology and the challenges of meeting performance and power consumption demands for mobile applications
Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More informationPerformance of Variant Memory Configurations for Cray XT Systems
Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationNVIDIA nforce IGP TwinBank Memory Architecture
NVIDIA nforce IGP TwinBank Memory Architecture I. Memory Bandwidth and Capacity There s Never Enough With the recent advances in PC technologies, including high-speed processors, large broadband pipelines,
More informationFour-Socket Server Consolidation Using SQL Server 2008
Four-Socket Server Consolidation Using SQL Server 28 A Dell Technical White Paper Authors Raghunatha M Leena Basanthi K Executive Summary Businesses of all sizes often face challenges with legacy hardware
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationThe Memory Hierarchy 1
The Memory Hierarchy 1 What is a cache? 2 What problem do caches solve? 3 Memory CPU Abstraction: Big array of bytes Memory memory 4 Performance vs 1980 Processor vs Memory Performance Memory is very slow
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationWhite Paper. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)
White Paper First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem) Introducing a New Dynamically and Design- Scalable Microarchitecture that Rewrites the Book On Energy Efficiency
More informationOpen Innovation with Power8
2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Open Innovation with Power8 Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation 2013
More informationAMD Opteron Processors In the Cloud
AMD Opteron Processors In the Cloud Pat Patla Vice President Product Marketing AMD DID YOU KNOW? By 2020, every byte of data will pass through the cloud *Source IDC 2 AMD Opteron In The Cloud October,
More informationPOWER7: IBM's Next Generation Server Processor
Hot Chips 21 POWER7: IBM's Next Generation Server Processor Ronald Kalla Balaram Sinharoy POWER7 Chief Engineer POWER7 Chief Core Architect Acknowledgment: This material is based upon work supported by
More informationSU Dual and Quad-Core Xeon UP Server
SU4-1300 Dual and Quad-Core Xeon UP Server www.eslim.co.kr Dual and Quad-Core Server Computing Leader!! ESLIM KOREA INC. 1. Overview eslim SU4-1300 The ideal entry-level server Intel Xeon processor 3000/3200
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationThe Computer Revolution. Classes of Computers. Chapter 1
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition 1 Chapter 1 Computer Abstractions and Technology 1 The Computer Revolution Progress in computer technology Underpinned by Moore
More information18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013
18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationThis Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System
This Unit: Main Memory Building a Memory System Application OS Compiler Firmware CPU I/O Memory Digital Circuits Gates & Transistors Memory hierarchy review DRAM technology A few more transistors Organization:
More informationSpring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand
Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates
More informationAddendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches
Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com
More informationEFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES
EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University
More informationAgenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >
Agenda Sun s x86 1. Sun s x86 Strategy 2. Sun s x86 Product Portfolio 3. Virtualization < 1 > 1. SUN s x86 Strategy Customer Challenges Power and cooling constraints are very real issues Energy costs are
More informationAdvanced Computer Architecture (CS620)
Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationKey Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs
IO 1 Today IO 2 Key Points CPU interface and interaction with IO IO devices The basic structure of the IO system (north bridge, south bridge, etc.) The key advantages of high speed serial lines. The benefits
More informationABySS Performance Benchmark and Profiling. May 2010
ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationBREAKING THE MEMORY WALL
BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationEPYC VIDEO CUG 2018 MAY 2018
AMD UPDATE CUG 2018 EPYC VIDEO CRAY AND AMD PAST SUCCESS IN HPC AMD IN TOP500 LIST 2002 TO 2011 2011 - AMD IN FASTEST MACHINES IN 11 COUNTRIES ZEN A FRESH APPROACH Designed from the Ground up for Optimal
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationHPC Technology Trends
HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations
More informationSAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation
SAS Enterprise Miner Performance on IBM System p 570 Jan, 2008 Hsian-Fen Tsao Brian Porter Harry Seifert IBM Corporation Copyright IBM Corporation, 2008. All Rights Reserved. TABLE OF CONTENTS ABSTRACT...3
More informationBuilding blocks for custom HyperTransport solutions
Building blocks for custom HyperTransport solutions Holger Fröning 2 nd Symposium of the HyperTransport Center of Excellence Feb. 11-12 th 2009, Mannheim, Germany Motivation Back in 2005: Quite some experience
More information10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache
Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is
More informationA unified multicore programming model
A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging
More informationPUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES
PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES Greg Hankins APRICOT 2012 2012 Brocade Communications Systems, Inc. 2012/02/28 Lookup Capacity and Forwarding
More informationIntroduction to OpenMP. Lecture 10: Caches
Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for
More informationMulti-Core Microprocessor Chips: Motivation & Challenges
Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005
More informationRobert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!
Robert Jamieson Robs Techie PP Everything in this presentation is at your own risk! PC s Today Basic Setup Hardware pointers PCI Express How will it effect you Basic Machine Setup Set the swap space Min
More informationI/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1
Memory (RAM) Standard Industry Memory Module (SIMM) RDRAM and SDRAM Access to RAM is extremely slow compared to the speed of the processor Memory busses (front side busses FSB) run at 100MHz to 800MHz
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology
More informationTHE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research
THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability
More informationAmdahl's Law in the Multicore Era
Amdahl's Law in the Multicore Era Explain intuitively why in the asymmetric model, the speedup actually decreases past a certain point of increasing r. The limiting factor of these improved equations and
More informationECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018
ECE 172 Digital Systems Chapter 15 Turbo Boost Technology Herbert G. Mayer, PSU Status 8/13/2018 1 Syllabus l Introduction l Speedup Parameters l Definitions l Turbo Boost l Turbo Boost, Actual Performance
More informationTransistors and Wires
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis Part II These slides are based on the slides provided by the publisher. The slides
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationMemory: Past, Present and Future Trends Paolo Faraboschi
Memory: Past, Present and Future Trends Paolo Faraboschi Fellow, Hewlett Packard Labs Systems Research Lab Quiz ( Excerpt from Intel Developer Forum Keynote 2000 ) ANDREW GROVE: is there a role for more
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationHardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More information