Commercially Available Chip Mul3processors for Research. Welcome to the MulE core Era
|
|
- Dina Russell
- 5 years ago
- Views:
Transcription
1 4/2/11 ommercially Available hip Mul3processors for Research Bruce hilders University of Pi9sburgh h9p:// AAO h9p:// h9p:// team.org h9p:// Welcome to the MulE core Era hip muleprocessors are everywhere! ellular phone Tablets Netbooks Laptops Desktops Servers 1
2 Welcome to the MulE core Era hip muleprocessors are everywhere! ellular phone Tablets Netbooks Laptops Desktops Servers Qualcomm MSM866 Dual 1.5 GHz Scorpion GPU & cellular modem Up to 10giga operaeons per second App + media + radio operaeon Increasing by 10x every 5 years 1W available (from total) for compueng Ba9ery power determines limits May be single to muleple chips Modem (DMA) GPU (OpenGL) ore (ARM9) ore (ARM9) Processor Welcome to the MulE core Era hip muleprocessors are everywhere! ellular phone Tablets Netbooks Laptops Desktops Servers Intel Sandy Bridge NB 4 cores, 8 HW threads Integrated GPU, M Powerful applicaeons from consumer to science to business Single processor ( socket ) Moving toward high integraeon Moving more toward heterogeneous ore (x86) ore (x86) ore (x86) ore (x86) GPU M (DDR3) Processor 2
3 Welcome to the MulE core Era hip muleprocessors are everywhere! ellular phone Tablets Netbooks Laptops Desktops Servers AMD Opetron cores, 6MB L3 4 sockets, HyperTransport Range of services e.g., cloud compueng VirtualizaEon for server consolidaeon Power consumpeon (effeceve uelizaeon) MulEple cores per processor MulEple processors per machine (node) MulEple machines per cabinet ore ore (x86) ore ore (x86) INT INT (HT 3.1) ore ore M ore ore M (x86) (x86) (DDR3) Processor ( Socket ) Important A9ributes ore Nearby aches (L1, L2) ore Architecture Last Level ache (L3) Memory Power Management InterconnecEon Graphics Processing Uncore Architecture The uncore is what can ma9er for mule core It may also soon be the graphics processing capabiliees 3
4 Intel Processors NetBurst ore Nehalem Nehalem (45nm) Westmere (32nm) Westmere E Sandy Bridge Sandy Bridge (32nm) Ivy Bridge (22nm) first Intel dual core Nehalem Westmere Westmere E Sandy Bridge cores, HT, loosely integrated M/GPU 6 cores, HT, loosely integrated M/GPU, VM Server variant, 1cores, 4 processors (QPI), 2011? 6 cores, new uarch, closely integrated GPU & M Sandy Bridge Desktop, mobile & server variants Features Enhanced core microarchitecture More closely coupled & integrated components Hyper threading with up to 8 cores (16 threads) On chip shared L3 cache Turbo Boost power/speed management Later server versions will feature improved QuickPath Interconnect 4
5 5 Intel Sandy Bridge GPU ache North Bridge PIe x16 Display South Bridge DMI 4 cores with L1, L2, L3 cache Hyper threaded: 8 logical cores Advanced vector extensions (256 bit SIMD) Micro architecture changes (Improved branch predictor, changed register renaming for AVX, 2x load ports) Intel Sandy Bridge GPU ache North Bridge PIe x16 Display South Bridge DMI L1 instruceon cache 32KB L1 I cache Decode 4 x86 instr/cycle onverted to u ops 1.5K entry (L0) u op cache (just caches not trace cache) Gain is power
6 Intel Sandy Bridge GPU 32KB L1 data cache 256KB L2 cache (unified, private) ache DMI North Bridge PIe x16 Display South Bridge Intel Sandy Bridge GPU 8MB L3 cache (shared) Designed for high bandwidth Shared by cores + GPU 435 GB/sec 3.4 GHz* ache DMI North Bridge PIe x16 Display South Bridge * Source: Sandy Bridge Spans Genera5ons, Linley Gwennap, MPR, Sept
7 L3 cache PIe Display System Agent Memory ontroller ore L3 cache (2 MB) ore 1 L3 cache (2 MB) ore 2 L3 cache (2 MB) ore 3 L3 cache (2 MB) Graphics Processing Unit L3 cache PIe Display System Agent Memory ontroller ore L3 cache (2 MB) ore 1 L3 cache (2 MB) ore 2 L3 cache (2 MB) ore 3 L3 cache (2 MB) Graphics Processing Unit 7
8 L3 cache PIe Display ore ore 1 System Agent omposed of 4 rings 32 byte data Display Request Acknowledgement L3 cache (2 MB) Snooping Up to clock traversal Distributed coherence L3 cache (2 MB) ore 2 L3 cache (2 MB) ore 3 L3 cache (2 MB) Graphics Processing Unit Intel Sandy Bridge 1 2 GPU 3 ache Graphics processing unit Integrated on chip More closely coupled with cores (via L3 cache) New FUs & video codec DMI North Bridge PIe x16 Display South Bridge Source: Sandy Bridge Spans Genera5ons, Linley Gwennap, MPR, Sept
9 Intel Sandy Bridge GPU Uncore logic to connect to memory, display and I/O ache DMI North Bridge PIe x16 Display Dual channel memory South Bridge Source: Sandy Bridge Spans Genera5ons, Linley Gwennap, MPR, Sept. 201 Intel Sandy Bridge 1 2 GPU 3 ache Plarorm ontroller Hub (PH) onnects to I/O devices E.g., SATA disk, USB, PI Express, etc DMI North Bridge PIe x16 Display South Bridge Source: Sandy Bridge Spans Genera5ons, Linley Gwennap, MPR, Sept
10 Turbo Boost Power Management Thermal design point (TDP) Maximum power dissipated Baseline: onsider impact of all cores But not all cores are always aceve hange power allocaeon Introduced in Nehalem Shis available budget (under TDP) to boost speed of cores based on workload Feedback Loop Feedback Loop Monitoring Adjust voltage, frequency Boost to remain under TDP May temporarily exceed TDP ore state (AcEve, InacEve) ore ore OS state change ore ore Temperature Power EsEmated current Power Manager Speed setng 1
11 Feedback Loop Baseline frequency ores are aceve/inaceve Frequency with four cores OS state change trigger ore state (AcEve, InacEve) 200 OS state change Temperature Power EsEmated current Power Manager 200 Speed setng Feedback Loop InacEve cores ores moved to inaceve (3/6) Leaves headroom in TDP Spend on other cores OS state change ore state (AcEve, InacEve) Temperature Power EsEmated current Power Manager Boost cores Speed setng 11
12 Feedback Loop Adjust speed upward hange in small steps (10MHz) Up to maximum speed Stay under TDP ore state (AcEve, InacEve) 210 OS state change 210 Temperature Power EsEmated current Power Manager Boost cores Speed setng Feedback Loop Adjust speed upward hange in small steps (10MHz) Up to maximum speed Stay under TDP ore state (AcEve, InacEve) 220 OS state change 220 Temperature Power EsEmated current Power Manager Boost cores Speed setng 12
13 Feedback Loop Adjust speed downward Move back under TDP Temporarily exceeds b/c thermals change slowly ore state (AcEve, InacEve) 230 OS state change 230 Temperature Power EsEmated current Power Manager Reduce cores Speed setng Feedback Loop Adjust speed downward Move back under TDP Temporarily exceeds b/c thermals change slowly ore state (AcEve, InacEve) 220 OS state change 220 Temperature Power EsEmated current Power Manager Reduce cores Speed setng 13
14 Feedback Loop ore i7 2920XM, 4 cores, base 2.5 GHz AcEve ores Max Speed 3.2 GHz 3.3 GHz 3.4 GHz 3.5 GHz ore ore OS state change ore ore Temperature Power EsEmated current Power Manager Speed setng AMD Processors Shanghi (2008) Istanbul (2009) Magny ours (2010) Bulldozer (2011?) November 2008 Shanghi Istanbul Magny ours Bulldozer 4 cores (no HWT), 2.9 GHz, 45nm, 6MB L3, DDR2 6 cores, 2.8 GHz, 45nm, 6MB L3, DDR2, HT assist 12 cores, 2.6 GHz, 45nm, 12MB L3, DDR3, HT assist Tightly coupled cores, separate sched & FUs (HWT like), 16 cores, 32nm, 16 MB L3, 256 bit FPU January
15 AMD Opteron core x86 processor, Istanbul core architecture I/O I/O AMD Opteron core x86 processor, Istanbul core architecture Per package (MulE chip Module) 12 Istanbul cores 2 dies (nodes), 6 core ea, 45nm I/O Per node 6 MB shared L3 Memory controller I/O 2x memory channels 4x HyperTransport links 15
16 AMD Opteron core x86 processor, Istanbul core architecture Remote Local Non uniform memory access Shared address space Physically distributed to nodes I/O Transparently access any address Local address: faster access I/O Remote address: slower (going across the interconnect) HyperTransport Links HyperTransport Point to point interconnect (LVDS) Arranged as muleple links (e.g., x16 links) Up to 25.6 GB/second (x32 links) 4 x16 HT ports/processor allocated for withinpackage communicaeon, cross processor communicaeon & I/O 16
17 InterconnecEon (2 processors) P x16 cht x8 cht P2 4 x16 HyperTransport links x16 adjacent off package nodes x8 diagonal off package nodes x16 + x8 on package nodes x16 noncoherent I/O I/O x16 ncht P1 P3 InterconnecEon (4 processors) P P P P 4 x16 HyperTransport links x8 between off package nodes x16 + x8 on package nodes x16 noncoherent I/O P P P P 17
18 oherence Traffic Explosion in coherence traffic 4 processors, 48 cores! oherence Data may reside in muleple caches Need to keep it consistent Single writer, muleple readers Broadcast Request which core has must recent data learly, doesn t scale well HT Assist X Proc. X 1 Home is locaeon where memory address resides Data can be cached anywhere, though Need to find the locaeon Reader: Deliver poteneally most recent copy Writer: Get exclusive ownership to update data 18
19 HT Assist Proc. 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 HT Assist Proc. 2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 19
20 HT Assist Proc. Yes! 3 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 HT Assist Proc. 4 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 2
21 HT Assist Proc. Proc. X: 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 Maintain directory of data locaeon 1MB of L3 dedicated to directory Reduces traffic (locaeon known) HT Assist Proc. Proc. X: 1 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Data forwarded from P1 to P2 21
22 HT Assist 2 Proc. Proc. X: 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Data forwarded from P1 to P2 HT Assist Proc. Proc. X: 1 3 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Data forwarded from P1 to P2 22
23 HT Assist Proc. Proc. X: 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Data forwarded from P1 to P2 Only makes sense for >2 nodes Avoids most broadcasts Reduces L3 cache capacity HT Assist: Where to Keep Directory? 6 MB L3 cache with 16 ways 64 byte line with 16 directory entries DIR DIR Tag State Owner 4 byte directory entry (probe filter) DIR DIR Same processor in 1P and 4P systems Reduce costs by reusing the L3 cache for directory 16 ways, 4 ways dedicated to directory Sparse directory structure Maintain coherence state Modified (owner, dirty) Owned (owner, with sharers) Exclusive (one owner, consistent) Shared (shared, clean/dirty) Invalid (idenefied by lack of entry) Source: Hothips
24 AMD Opteron 610 Model Speed ores AP TDP Price 618SE 2.5 GHz W 14W $1514 * GHz 12 8W 115 W $1265 * GHz 12 8W 115 W $ GHz 12 8W 115 W $ GHz 8 8W 115 W $ GHz 8 8W 115 W $ HE 1.8 GHz W 85 W $ HE 1.7 GHz W 85 W $ HE 2.2 GHz 8 65 W 85 W $ HE 1.8 GHz 8 65 W 85 W $455 SE opemized for performance HE opemized for low power AP average PU power (workload derived power) All have 12 MB L3 (2x 6 MB), HT3, AMD V Introduced March 29, 201 * Introduced February 14, 2011 What s available? AVA Direct Supermicro SuperServer $5087 Quad AMD Opetron core 2.GHz (32) 64 GB memory, 50GB SATA drive Dell PowerEdge R415 $2457 Dual AMD Opetron 4170HE (6), 2.1 GHz (12) 16 GB memory, 25GB SATA drive Dell XPS 830(desktop) $1453 Intel ore i7 260(8MB, 3.4 GHz) 16 GB memory, 1TB SATA drive 24
25 Summary MulE core is certainly here! Significant research challenges Plarorm infrastructure ore architecture ache architecture InterconnecEon Power management IntegraEon and fusing of PU+GPU Today s processors offer many of these capabili3es for research! 25
AMD Opteron 4200 Series Processor
What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed
More informationCOSC 6385 Computer Architecture. - Thread Level Parallelism (III)
OS 6385 omputer Architecture - Thread Level Parallelism (III) Spring 2013 Some slides are based on a lecture by David uller, University of alifornia, Berkley http://www.eecs.berkeley.edu/~culler/courses/cs252-s05
More informationPhilippe Thierry Sr Staff Engineer Intel Corp.
HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market
More informationIntel Workstation Technology
Intel Workstation Technology Turning Imagination Into Reality November, 2008 1 Step up your Game Real Workstations Unleash your Potential 2 Yesterday s Super Computer Today s Workstation = = #1 Super Computer
More informationModern computer architecture. From multicore to petaflops
Modern computer architecture From multicore to petaflops Motivation: Multi-ores where and why Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationXT Node Architecture
XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core
More informationCOSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors
OS 6385 omputer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Spring 2012 Long-term trend on the number of transistor per integrated circuit Number of transistors
More informationIntel Core i7 Processor
Intel Core i7 Processor Vishwas Raja 1, Mr. Danish Ather 2 BSc (Hons.) C.S., CCSIT, TMU, Moradabad 1 Assistant Professor, CCSIT, TMU, Moradabad 2 1 vishwasraja007@gmail.com 2 danishather@gmail.com Abstract--The
More informationMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS Najem N. Sirhan 1, Sami I. Serhan 2 1 Electrical and Computer Engineering Department, University of New Mexico, Albuquerque, New Mexico, USA 2 Computer
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationCore 2 vs I-series. How Far Have We Really Come?
Core 2 vs I-series How Far Have We Really Come? Appendix 1. Introduction 2. Road map 3. General specifications 4. Hardware subtleties 5. Technology difference 6. Advantages of the new architecture 7. Conclusion
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationIntroducing Sandy Bridge
Introducing Sandy Bridge Bob Valentine Senior Principal Engineer 1 Sandy Bridge - Intel Next Generation Microarchitecture Sandy Bridge: Overview Integrates CPU, Graphics, MC, PCI Express* On Single Chip
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers
William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved
More informationCMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3
MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance
More informationAMD Opteron Processors In the Cloud
AMD Opteron Processors In the Cloud Pat Patla Vice President Product Marketing AMD DID YOU KNOW? By 2020, every byte of data will pass through the cloud *Source IDC 2 AMD Opteron In The Cloud October,
More informationEPYC VIDEO CUG 2018 MAY 2018
AMD UPDATE CUG 2018 EPYC VIDEO CRAY AND AMD PAST SUCCESS IN HPC AMD IN TOP500 LIST 2002 TO 2011 2011 - AMD IN FASTEST MACHINES IN 11 COUNTRIES ZEN A FRESH APPROACH Designed from the Ground up for Optimal
More informationSix-Core AMD Opteron Processor
What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationScalable Cache Coherent Systems
NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationCMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013
CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013 TOPICS TODAY Moore s Law Evolution of Intel CPUs IA-32 Basic Execution Environment IA-32 General Purpose Registers
More informationHPC Hardware Overview
HPC Hardware Overview John Lockman III April 19, 2013 Texas Advanced Computing Center The University of Texas at Austin Outline Lonestar Dell blade-based system InfiniBand ( QDR) Intel Processors Longhorn
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationPresented by : Sadegh Riyahi Majid Shokrolahi
Politecnico di Milano Polo Regionale di Como Architectures for multimedia systems Professor : Cristina Silvano Presented by : Sadegh Riyahi Majid Shokrolahi 29th June 2010 Outline Introduction What is
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More information1. NoCs: What s the point?
1. Nos: What s the point? What is the role of networks-on-chip in future many-core systems? What topologies are most promising for performance? What about for energy scaling? How heavily utilized are Nos
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationLecture 17. NUMA Architecture and Programming
Lecture 17 NUMA Architecture and Programming Announcements Extended office hours today until 6pm Weds after class? Partitioning and communication in Particle method project 2012 Scott B. Baden /CSE 260/
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationMulti-core Programming Evolution
Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution
More informationCopyright 2017 Intel Corporation
Agenda Intel Xeon Scalable Platform Overview Architectural Enhancements 2 Platform Overview 3x16 PCIe* Gen3 2 or 3 Intel UPI 3x16 PCIe Gen3 Capabilities Details 10GbE Skylake-SP CPU OPA DMI Intel C620
More informationLS-DYNA Performance Benchmark and Profiling. April 2015
LS-DYNA Performance Benchmark and Profiling April 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationFoundations of Computer Systems
18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:
More informationThe Future of Computing: AMD Vision
The Future of Computing: AMD Vision Tommy Toles AMD Business Development Executive thomas.toles@amd.com 512-327-5389 Agenda Celebrating Momentum Years of Leadership & Innovation Current Opportunity To
More informationIntel Compiler. Advanced Technical Skills (ATS) North America. IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D.
Intel Compiler IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D. yjw@us.ibm.com 2/22/2010 Nehalem-EP CPU Summary Performance/Features: 4 cores 8M on-chip Shared Cache Simultaneous Multi-
More informationMICROPROCESSOR TECHNOLOGY
MICROPROCESSOR TECHNOLOGY Assis. Prof. Hossam El-Din Moustafa Lecture 20 Ch.10 Intel Core Duo Processor Architecture 2-Jun-15 1 Chapter Objectives Understand the concept of dual core technology. Look inside
More informationHMEM and Lemaitre2: First bricks of the CÉCI s infrastructure
HMEM and Lemaitre2: First bricks of the CÉCI s infrastructure - CÉCI: What we want - Cluster HMEM - Cluster Lemaitre2 - Comparison - What next? - Support and training - Conclusions CÉCI: What we want CÉCI:
More informationA+ Guide to Managing & Maintaining Your PC, 8th Edition. Chapter 4 All About Motherboards
Chapter 4 All About Motherboards Objectives Learn about the different types and features of motherboards Learn how to use setup BIOS and physical jumpers to configure a motherboard Learn how to maintain
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018
ECE 172 Digital Systems Chapter 15 Turbo Boost Technology Herbert G. Mayer, PSU Status 8/13/2018 1 Syllabus l Introduction l Speedup Parameters l Definitions l Turbo Boost l Turbo Boost, Actual Performance
More informationIt's called the Core i7, but we knew it as Nehalem. We go through the entire micro-architecture and explain the new developments from IDF.
Nehalem - Everything You Need to Know about Intel's New Architecture by Anand Lal Shimpi on 11/3/2008 1:00:00 PM Posted in CPUs It's called the Core i7, but we knew it as Nehalem. We go through the entire
More informationTechnologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017
Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of
More informationMeet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief
Meet the Increased Demands on Your Infrastructure with Dell and Intel ServerWatchTM Executive Brief a QuinStreet Excutive Brief. 2012 Doing more with less is the mantra that sums up much of the past decade,
More informationShared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB
Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationComputer Architecture. Memory Hierarchy. Lynn Choi Korea University
Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationPlatforms Design Challenges with many cores
latforms Design hallenges with many cores Raj Yavatkar, Intel Fellow Director, Systems Technology Lab orporate Technology Group 1 Environmental Trends: ell 2 *Other names and brands may be claimed as the
More informationAgenda. What is Ryzen? History. Features. Zen Architecture. SenseMI Technology. Master Software. Benchmarks
Ryzen Agenda What is Ryzen? History Features Zen Architecture SenseMI Technology Master Software Benchmarks The Ryzen Chip What is Ryzen? CPU chip family released by AMD in 2017, which uses their latest
More informationH61MLV Intel Core i7 LGA 1155 Processor. Intel Core i5 LGA 1155 Processor. Intel Core i3 LGA 1155 Processor. Intel Pentium LGA 1155 Processor
H61MLV3 8.0 Socket LGA 1155 Supported the Intel 3rd and 2nd generation Core i7/ i5/ i3 processors in the 1155 package Supported 2 DIMM of DDR3 1600/1333/1066MHz Supports BIO-Remote 2 Technology Chipset
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationA Case Study in Optimizing GNU Radio s ATSC Flowgraph
A Case Study in Optimizing GNU Radio s ATSC Flowgraph Presented by Greg Scallon and Kirby Cartwright GNU Radio Conference 2017 Thursday, September 14 th 10am ATSC FLOWGRAPH LOADING 3% 99% 76% 36% 10% 33%
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationDell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance
Dell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance A Dell Technical White Paper Dell Product Group Armando Acosta and James Pledge THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationManaging Data Center Power and Cooling
Managing Data Center Power and Cooling with AMD Opteron Processors and AMD PowerNow! Technology Avoiding unnecessary energy use in enterprise data centers can be critical for success. This article discusses
More informationThe mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management
Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management
More informationPactron FPGA Accelerated Computing Solutions
Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationPOWER YOUR CREATIVITY WITH THE INTEL CORE X-SERIES PROCESSOR FAMILY
Product Brief POWER YOUR CREATIVITY WITH THE INTEL CORE X-SERIES PROCESSOR FAMILY The Ultimate Creator PC Platform Made to create, the latest X-series processor family is powered by up to 18 cores and
More informationPC I/O. May 7, Howard Huang 1
PC I/O Today wraps up the I/O material with a little bit about PC I/O systems. Internal buses like PCI and ISA are critical. External buses like USB and Firewire are becoming more important. Today also
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationHyperTransport Technology
HyperTransport Technology in 2009 and Beyond Mike Uhler VP, Accelerated Computing, AMD President, HyperTransport Consortium February 11, 2009 Agenda AMD Roadmap Update Torrenza, Fusion, Stream Computing
More informationThe Next Revolution in Computer Systems Architecture
The Next Revolution in Computer Systems Architecture Richard Oehler Corporate Fellow Office of the CTO University of Mannheim 2/08/07 Computer Systems Architecture Not just the Processor Chip It s all
More informationDART- CUDA: A PGAS RunAme System for MulA- GPU Systems
DART- CUDA: A PGAS RunAme System for MulA- GPU Systems Lei Zhou, Karl Fürlinger presented by Ma#hias Maiterth Ludwig- Maximilians- Universität München (LMU) Munich Network Management Team (MNM) InsAtute
More informationCSC501 Operating Systems Principles. OS Structure
CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last
More informationFAST FORWARD TO YOUR <NEXT> CREATION
FAST FORWARD TO YOUR CREATION THE ULTIMATE PROFESSIONAL WORKSTATIONS POWERED BY INTEL XEON PROCESSORS 7 SEPTEMBER 2017 WHAT S NEW INTRODUCING THE NEW INTEL XEON SCALABLE PROCESSOR BREAKTHROUGH PERFORMANCE
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationEITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor
EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationCopyright 2014 Splunk Inc. Splunk for VMware. Architecture & Design. Michael Donnelly, Sr. Sales Engineer
Copyright 2014 Splunk Inc. Splunk for VMware Architecture & Design Michael Donnelly, Sr. Sales Engineer Disclaimer During the course of this presentaeon, we may make forward looking statements regarding
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationBREAKING THE MEMORY WALL
BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power
More informationServer Sizing Joe Chang qdpma.com Jchang6 at yahoo
Server Sizing 2018 Joe Chang qdpma.com Jchang6 at yahoo About Joe SQL Server consultant since 1999 Query Optimizer execution plan cost formulas (2002) True cost structure of SQL plan operations (2003?)
More informationIP Device Integration Notes
IP Device Integration Notes Article ID: V1-15-01-20-t Release Date: 01/20/2015 Applied to GV-VMS V14.10 Summary The document consists of three sections: 1. The total frame rate and the number of channels
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationScalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels
More informationVector Engine Processor of SX-Aurora TSUBASA
Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance
More informationWhite Paper. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)
White Paper First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem) Introducing a New Dynamically and Design- Scalable Microarchitecture that Rewrites the Book On Energy Efficiency
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (III)
OS 6385 omputer Architecture - Thread Level Parallelism (III) Edgar Gabriel Spring 2018 Some slides are based on a lecture by David uller, University of alifornia, Berkley http://www.eecs.berkeley.edu/~culler/courses/cs252-s05
More informationKey Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs
IO 1 Today IO 2 Key Points CPU interface and interaction with IO IO devices The basic structure of the IO system (north bridge, south bridge, etc.) The key advantages of high speed serial lines. The benefits
More information