Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
|
|
- Silvester Ball
- 5 years ago
- Views:
Transcription
1 Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
2 Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT Aim of Victoria Falls Hardware scaling Data Scaling the OS Virtualization and Consolidation Scaling Applications Conclusions Page 2
3 Memory Bottleneck Relative Performance CPU Frequency DRAM Speeds CPU -- 2x Every 2 Years DRAM -- 2x Every 6 Years Gap Source: Sun World Wide Analyst Conference Feb. 25, 2003
4 CMT Implementation Four threads share a single pipeline Every cpu cycle an instruction from a different thread is executed Thread 4 Thread 3 Thread 2 Thread 1 C C C M C M M C M C C M C M M C M C C M C M M M Niagara Processor Shared Pipeline Utilization: Up to 85% Memory Latency Compute Time Page 4
5 Lessons from CMT Multiple cores are here to stay. Already quad core, 8 core on most roadmaps Big impact on commercial applications, Big problem for desktop and laptop software The pursuit of frequency is over. Apart from one or two vendors most have dropped their 4GHz and 5GHz plans Very high frequency usually means higher power Higher frequency usually adds complexity and lengthens time to market
6 Lessons learned from CMT Power and cooling have become extremely important to most customers Ecoresponsibility and carbon footprint are the new standards for datacenters Many datacenters are at capacity Acceleration in the development of lower power systems Also accelerated Virtualization technology. Customer trading performance versus power
7 Lessons learned from CMT Many throughput workloads benefit more from higher thread count than higher frequency CMT has proven this time and again with public benchmarks Some code is single threaded and we need to work to make them parallel Not as big an issue in the HPC space but even here can have periods such as results aggregation and report generation that are serial
8 Big Question How do you make CMT Scale to 32 cores and 256 threads Page 8
9 Aim of Victoria Falls Create an SMP version of CMT to extend the highly threaded Niagara design Use T2 as the basis for these systems Minimal modifications to T2 for shorter time to market Data Create two-way and four-way systems without the need for a traditional SMP backplane Avoid any hardware bottlenecks to scaling High throughput low latency interconnect Scale memory bandwidth with processors Scale I/O bandwidth with processors Hardware features to enable software scaling Page 9
10 Victoria Falls Processor VF is a derivative. Re-use UltraSPARC T2 cores, crossbar, L2$, PCI-Express I/F Remove 10GE network interface Replace two memory channels with four coherence channels Increase memory link speed to 4.8Gbps (was 4.0Gbps) Added relaxed DMA ordering to PCI- Ex interface All I/O on VF via the x8 PCI-E link per chip, scale I/O with processors Similar packaging as US T2 4 Coherence Channels 6.4 GB/s per channel each direction, 51GB/s total Dual Channel Coherence Unit FBDIMM Memory Controller Coherence Unit PCI-Express 2.5Ghz 2 GB/s each direction Dual Channel FBDIMM Controller Coherence Coherence Unit Memory Unit Niagara2 Cores, Crossbar, L2$ (8 cores, 64 threads, 4MB L2$) NCU, DMU 25 GB/s read 13 GB/s write NCX Page 10
11 Hardware Scaling - Interconnect Interconnect is key to scaling to 32 cores and 256 threads Decision not to use a traditional bus given the memory bandwidth requirements Instead we used half the FBDIMM channels for memory Data and the other half to create 4 bidirectional links for interconnect Physical link of interconnect same as FBDIMM Reuse the FBDIMM pins for the interconnect Result is a high speed SERDES link Provides high bandwidth and a low latency interconnect Page 11
12 VF 2-Socket System Dual Channel FBDIMM Dual Channel FBDIMM Dual Channel FBDIMM Dual Channel FBDIMM Memory Controller Memory Controller Memory Controller Memory Controller Coherence Unit Coherence Unit Coherence Unit Coherence Unit Coherence Unit Coherence Unit Coherence Unit Coherence Unit Niagara2 Cores, Crossbar, L2$ (8 cores, 64 threads, 4MB L2$) Niagara2 Cores, Crossbar, L2$ (8 cores, 64 threads, 4MB L2$) PCI-Express NCU, DMU NCX PCI-Express NCU, DMU NCX System IO (Network, Disk, etc.)
13 VF 4-Socket System PCI-Express VF FBDIMM Dual Channel Dual Channel FBDIMM VF PCI-Express 8 Cores 8 Cores 64 Threads 4MB L2$ FBDIMM Dual Channel Coherence Hubs (4) Dual Channel FBDIMM 64 Threads 4MB L2$ 4 ports/hub PCI-Express VF VF FBDIMM Dual Channel 6.4GB/s per port in each dir Dual Channel FBDIMM VF PCI-Express 8 Cores 8 Cores 8 Threads/Core 64 4MB 4MB L2$ L2$ FBDIMM Dual Channel Dual Channel FBDIMM 8 Cores 64 Threads 4MB L2$
14 Hardware Scaling - Memory Two elements are key to scaling on highly threaded servers Memory bandwidth. Hundreds of threads require a very wide pipe to memory Local vs. Remote memory latency. Systems that are highly NUMA Data can show inverse scaling if memory is placed badly Each VF processor has its own local memory connected delivering high memory bandwidth > 21GB/sec Read and 10GB/sec Write > Memory bandwidth scales with the number of processors Remote memory latency has a 76ns penalty for access, about 1.5x latency to local memory This makes VF systems NUMA although not highly so. Page 14
15 Software Scaling - scaling the OS There are a set of common issues when scaling the OS on a Multi-core highly threaded system Scaling single threaded kernel services Scaling contended mutexes Optimising memory placement Data Optimal scheduling across cores Scaling I/O Local vs. remote Delivering interrupts Virtualization Page 15
16 Single threaded kernel services Prime example of a single threaded kernel service is when Solaris performs accounting and book keeping activities every clock tick. As the number of CPUs increases, the tick accounting loop gets larger. On a 4 way VF system there are 256 threads to check Data Tick accounting is a single threaded activity and on a busy system with many CPUs, the loop can often take more than a tick to process if the locks it needs to acquire are busy. This causes clock drift and other unpleasantness. Solution is to involve multiple cpus in the Tick scheduling. Cpus are collected in groups and one is chosen to schedule all their ticks Algorithm is used to spread the tick accounting evenly across active cpus. Page 16
17 Scaling contended mutexes mutexes in Solaris are optimised for the non contended case. Contended mutexes are considered rare With up to 256 active threads in a single OS instance, calls such as atomic_add_32() on contended mutexes can become much hotter. Data Solaris solution is an Exponential backoff algorithm. This has the advantage of little overhead in the noncontended case and performance gains in the contended case Page 17
18 NUMA aware OS As we have seen VF systems are the first NUMA CMT servers A NUMA aware OS ultimately helps in scaling an application as it reduces dramatically the interconnect traffic between chips in the system Data Solaris has been NUMA aware since Solaris 9 Processes are assigned to a local lgroup where hot areas of their address space such as heap and stack are allocated and where they will be scheduled The effect is that there is an increase in accesses to lower latency local memory Page 18
19 The Hypervisor: Virtualization for CMT Platforms sun4v architecture Operating System Solaris X update (genunix) sun4u code Solaris X (genunix) Solaris X (sun4v) US-Z CPU code sun4v interface CPU Z SPARC hypervisor SPARC CPU Platform Page 19
20 The Sun4v/HV/LDOMs Model Replace HW domains with Logical Domains > Highly flexible Each Domain runs an independent OS > Capability for specialized domains Logical Domain 1 Service Domain Logical Domain 2 Solaris 10 App App Container Logical Domain 3 Open Solaris App App Container 1 App App Container 2 Hypervisor Hardware Shared CPU, Memory, IO I/O CPU Mem CPU Mem CPU Mem CPU Page 20
21 Virtualized I/O Logical Domain A Service Domain App App App App Device Driver /pci@b/qlc@6 Privileged Virtual Device Driver Virtual Device Service Nexus Driver /pci@b Hyper Privileged Hypervisor Domain Channel Virtual Nexus I/F Hardware I/O MMU PCI Root I/O Bridge PCI B Page 21
22 The future - Hybrid I/O Logical Domain A Service Domain App App App App Common Device Driver Privileged Device Driver Common Device Service Nexus Driver /pci@b Hyper Privileged Hypervisor Domain Channel Virtual Nexus I/F Hardware I/O MMU PCI Root I/O Bridge PCI B Page 22
23 Scaling Applications on many cores Scaling and throughput are the most important goals for performance on highly threaded systems High Utilisation of the pipelines is key to performance. Need many software threads or processes to utilise the hardware Datathreads. Threads have to be actually working Multiple instances of an application may be necessary to achieve scaling A single thread can become the bottleneck for the application Spinning in a tight loop waiting for a hot lock is not good for scaling on CMT Page 23
24 Utilization and Corestat In order to determine the utilization we use low level hardware counters to count the number of instructions per second per pipeline both integer and floating point We have written a tool called corestat which collects data from the low level performance counters and aggregates it to present utilization percentages Available from Corestat reports : Individual integer pipeline utilization (Default mode) Floating Point Unit utilization -g flag) Page 24
25 Locking issues The most common reason an application fails to scale on CMT is hot locks We have found this especially true when migrating from 2-4 way systems where access to hot locks and structures is effectively serialized If a system has 16 cores and 128 threads, then there can be up to 128 instruction streams executing in parallel. Hot locks which will tend to reside in the L2 cache and can become extremely contended. When developing locking code we need to use a low-impact, long-latency opcode to add delay in the busy-wait loop. The low-impact opcode frees up cycles so that other threads sharing the core get more useful work done Page 25
26 Proven Applications Areas for CMT Web-servers: Apache, SunOne, Lighttpd Web Proxy: Sun Web Proxy J2SE Appservers: BEA WLS, Websphere, SunOne, Oracle Database Servers: Oracle, Sybase, mysql, Postgres, DB2 Mail Servers: Sendmail, Domino, Brightmail, Openwave ERP: Siebel, Peoplesoft Oracle E-Business Security Page 26
27 Proven Applications Areas for CMT JES Stack: Directory, Portal, Access Manager NFS Servers DNS: BIND E-Learning: Blackboard Capacity Planing: BMC, HP, Tivoli Net Backup: Search: Lucene Development: Clearcase, compile servers Streaming Video Page 27
28 VF Performance Scaling Internal OLTP HPC Matrix Java Business CPU Intensive Single-chip 1.4Ghz Victoria Falls, 8 Cores, 64 Threads, Solaris Dual-chip 1.4Ghz Victoria Falls, 16 Cores, 128 Threads, Solaris Page 28
29 VF Availability Victoria Falls 2-socket Niagara systems Available today Victoria Falls 2-socket Niagara blades 2HCY08 Victoria Falls 4-socket Niagara systems 1HCY08 Data Page 29
30 Conclusions Multiple cores are here to stay and the pursuit of frequency is over Power and cooling are the new driving factors in the datacenter. This requires much higher utilization through scaling, consolidation and virtualization A high bandwidth, Data low latency interconnect is essential to achieve good scalability on multi-core systems Scalable memory bandwidth and physical I/O is also required An OS running on a highly threaded processor must also be scalable and in many cases NUMA aware. It must also be virtualizable Servers with many cores and hardware threads present a different set of scaling issues to applications Page 30
Logical Domains (LDoms)
Logical Domains (LDoms) Liam Merwick LDoms Developer Sun Microsystems SPARC Platform S/W Group Contents Background Features Architecture Components Future Roadmap Configuration Examples Q & A Page: 2 Background
More informationAdapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES
Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Tom Atwood Business Development Manager Sun Microsystems, Inc. Takeaways Understand the technical differences between
More informationOverview of the SPARC Enterprise Servers
Overview of the SPARC Enterprise Servers SPARC Enterprise Technologies for the Datacenter Ideal for Enterprise Application Deployments System Overview Virtualization technologies > Maximize system utilization
More informationSolaris Engineered Systems
Solaris Engineered Systems SPARC SuperCluster Introduction Andy Harrison andy.harrison@oracle.com Engineered Systems, Revenue Product Engineering The following is intended to outline
More informationVirtual Leverage: Server Consolidation in Open Source Environments. Margaret Lewis Commercial Software Strategist AMD
Virtual Leverage: Server Consolidation in Open Source Environments Margaret Lewis Commercial Software Strategist AMD What Is Virtualization? Abstraction of Hardware Components Virtual Memory Virtual Volume
More informationLeveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD
Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation
More informationOracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011
Oracle Performance on M5000 with F20 Flash Cache Benchmark Report September 2011 Contents 1 About Benchware 2 Flash Cache Technology 3 Storage Performance Tests 4 Conclusion copyright 2011 by benchware.ch
More informationPERFORMANCE CONSIDERATIONS FOR DEVELOPERS UTILIZING SUN SPARC ENTERPRISE M-SERIES SERVERS
July 2008 PERFORMANCE CONSIDERATIONS FOR DEVELOPERS UTILIZING SUN SPARC ENTERPRISE M-SERIES SERVERS Important note: this paper was originally published before the acquisition of Sun Microsystems by Oracle
More informationPower your planet. Optimizing the Enterprise Data Center POWER7 Powers a Smarter Infrastructure
Power your planet. Optimizing the Enterprise Data Center POWER7 Powers a Smarter Infrastructure Enoch Lau Field Technical Sales Specialist, Power Systems Systems & Technology Group Power your planet. Smarter
More informationThe Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers
The Missing Piece of Virtualization I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers Agenda 10 GbE Adapters Built for Virtualization I/O Throughput: Virtual & Non-Virtual Servers Case
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationSun Fire V880 System Architecture. Sun Microsystems Product & Technology Group SE
Sun Fire V880 System Architecture Sun Microsystems Product & Technology Group SE jjson@sun.com Sun Fire V880 Enterprise RAS Below PC Pricing NEW " Enterprise Class Application and Database Server " Scalable
More informationThese slides contain projections or other forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and
These slides contain projections or other forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934,
More informationThe AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA
The AMD64 Technology for Server and Workstation Dr. Ulrich Knechtel Enterprise Program Manager EMEA Agenda Direct Connect Architecture AMD Opteron TM Processor Roadmap Competition OEM support The AMD64
More informationPerformance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades
Performance Benefits of OpenVMS V8.4 Running on BL8xc i2 Server Blades A detailed review of performance features and test results for OpenVMS V8.4. March 211 211, TechWise Research. All Rights Reserved
More informationDynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization
Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Jeffrey Young, Sudhakar Yalamanchili School of Electrical and Computer Engineering, Georgia Institute of Technology Talk
More informationPerformance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware
Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More information2008 International ANSYS Conference
28 International ANSYS Conference Maximizing Performance for Large Scale Analysis on Multi-core Processor Systems Don Mize Technical Consultant Hewlett Packard 28 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationGemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.
Gemini: A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor Sanjiv Kapil Gemini Architect Sun Microsystems, Inc. Design Goals Designed for compute-dense, transaction oriented systems (webservers,
More informationAutomatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP
Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Agenda What is NUMA, anyway? Automatic NUMA balancing internals
More informationChapter 2: Computer-System Structures. Hmm this looks like a Computer System?
Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding
More informationAn Oracle White Paper February Enabling End-to-End 10 Gigabit Ethernet in Oracle s Sun Netra ATCA Product Family
An Oracle White Paper February 2011 Enabling End-to-End 10 Gigabit Ethernet in Oracle s Sun Netra ATCA Product Family Executive Overview... 1! Introduction... 1! An Ecosystem Delivering Bandwidth and Flexibility...
More informationUniprocessor Computer Architecture Example: Cray T3E
Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure
More informationMaximizing Six-Core AMD Opteron Processor Performance with RHEL
Maximizing Six-Core AMD Opteron Processor Performance with RHEL Bhavna Sarathy Red Hat Technical Lead, AMD Sanjay Rao Senior Software Engineer, Red Hat Sept 4, 2009 1 Agenda Six-Core AMD Opteron processor
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers
William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved
More informationSunFire range of servers
TAKE IT TO THE NTH Frederic Vecoven Sun Microsystems SunFire range of servers System Components Fireplane Shared Interconnect Operating Environment Ultra SPARC & compilers Applications & Middleware Clustering
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationRecent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect
Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect Copyright 2017, Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The following is intended to
More informationSun N1: Storage Virtualization and Oracle
OracleWorld 2003 Session 36707 - Sun N1: Storage Virtualization and Oracle Glenn Colaco Performance Engineer Sun Microsystems Performance and Availability Engineering September 9, 2003 Background PAE works
More informationLink Aggregation: A Server Perspective
Link Aggregation: A Server Perspective Shimon Muller Sun Microsystems, Inc. May 2007 Supporters Howard Frazier, Broadcom Corp. Outline The Good, The Bad & The Ugly LAG and Network Virtualization Networking
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationAgenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >
Agenda Sun s x86 1. Sun s x86 Strategy 2. Sun s x86 Product Portfolio 3. Virtualization < 1 > 1. SUN s x86 Strategy Customer Challenges Power and cooling constraints are very real issues Energy costs are
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationWorkload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013
Workload Optimized Systems: The Wheel of Reincarnation Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013 Outline Definition Technology Minicomputers Prime Workstations Apollo Graphics
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationQLE10000 Series Adapter Provides Application Benefits Through I/O Caching
QLE10000 Series Adapter Provides Application Benefits Through I/O Caching QLogic Caching Technology Delivers Scalable Performance to Enterprise Applications Key Findings The QLogic 10000 Series 8Gb Fibre
More informationOracle Exadata: Strategy and Roadmap
Oracle Exadata: Strategy and Roadmap - New Technologies, Cloud, and On-Premises Juan Loaiza Senior Vice President, Database Systems Technologies, Oracle Safe Harbor Statement The following is intended
More informationPOWER7+ TM IBM IBM Corporation
POWER7+ TM 2012 Corporation Outline POWER Processor History Design Overview Performance Benchmarks Key Features Scale-up / Scale-out The new accelerators Advanced energy management Summary * Statements
More informationState of the Linux Kernel
State of the Linux Kernel Timothy D. Witham Chief Technology Officer Open Source Development Labs, Inc. 1 Agenda Process Performance/Scalability Responsiveness Usability Improvements Device support Multimedia
More informationLinux multi-core scalability
Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationReplacement Policy: Which block to replace from the set?
Replacement Policy: Which block to replace from the set? Direct mapped: no choice Associative: evict least recently used (LRU) difficult/costly with increasing associativity Alternative: random replacement
More informationThread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007
Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster
More informationReference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded
Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationGoro Watanabe. Bill King. OOW 2013 The Best Platform for Big Data and Oracle Database 12c. EVP Fujitsu R&D Center North America
OOW 2013 The Best Platform for Big Data and Oracle Database 12c Goro Watanabe EVP Fujitsu R&D Center North America Bill King EVP Platform Products Group Fujitsu America, Inc. Overview 1. Fujitsu: Quick
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationEE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University
EE108B Lecture 17 I/O Buses and Interfacing to CPU Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements Remaining deliverables PA2.2. today HW4 on 3/13 Lab4 on 3/19
More informationVirtualization. Michael Tsai 2018/4/16
Virtualization Michael Tsai 2018/4/16 What is virtualization? Let s first look at a video from VMware http://www.vmware.com/tw/products/vsphere.html Problems? Low utilization Different needs DNS DHCP Web
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationOPENSPARC T1 OVERVIEW
Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share
More informationKeyStone II. CorePac Overview
KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB
More informationMM5 Modeling System Performance Research and Profiling. March 2009
MM5 Modeling System Performance Research and Profiling March 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center
More informationMultiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas
Multiple processor systems 1 Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor, access
More informationExperience the GRID Today with Oracle9i RAC
1 Experience the GRID Today with Oracle9i RAC Shig Hiura Pre-Sales Engineer Shig_Hiura@etagon.com 2 Agenda Introduction What is the Grid The Database Grid Oracle9i RAC Technology 10g vs. 9iR2 Comparison
More informationExample Networks on chip Freescale: MPC Telematics chip
Lecture 22: Interconnects & I/O Administration Take QUIZ 16 over P&H 6.6-10, 6.12-14 before 11:59pm Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Exams in ACES
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationOperating Systems, Fall Lecture 9, Tiina Niklander 1
Multiprocessor Systems Multiple processor systems Ch 8.1 8.3 1 Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor,
More informationNiagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems
Niagara-2: A Highly Threaded Server-on-a-Chip Greg Grohoski Distinguished Engineer Sun Microsystems August 22, 2006 Authors Jama Barreh Jeff Brooks Robert Golla Greg Grohoski Rick Hetherington Paul Jordan
More informationPerformance of Commercial Database Benchmarks on a CC-NUMA Computer System
STiNG Revisited: Performance of Commercial Database Benchmarks on a CC-NUMA Computer System Russell M. Clapp IBM NUMA-Q rclapp@us.ibm.com January 21st, 2001 1 Background STiNG was the code name of an engineering
More informationPart 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS
Some statistics 70% of OS code is in device s 3,448,000 out of 4,997,000 loc in Linux 2.6.27 A typical Linux laptop runs ~240,000 lines of kernel code, including ~72,000 loc in 36 different device s s
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationWhat Transitioning from 32-bit to 64-bit x86 Computing Means Today
What Transitioning from 32-bit to 64-bit x86 Computing Means Today Chris Wanner Senior Architect, Industry Standard Servers Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationChapter 6. Storage and Other I/O Topics
Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections
More informationRealize the True Potential of Server Virtualization with Oracle VM. Rossella Bellini Principal Sales Consultant HW Business Uniti
Realize the True Potential of Server Virtualization with Oracle Rossella Bellini Principal Sales Consultant HW Business Uniti Virtualization Budgets Are Increasing Virtualization Is a Key Driver for Lowering
More informationNested Virtualization and Server Consolidation
Nested Virtualization and Server Consolidation Vara Varavithya Department of Electrical Engineering, KMUTNB varavithya@gmail.com 1 Outline Virtualization & Background Nested Virtualization Hybrid-Nested
More informationMultiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.
Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than
More informationStorage. Hwansoo Han
Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationEnd-to-End Java Security Performance Enhancements for Oracle SPARC Servers Performance engineering for a revenue product
End-to-End Java Security Performance Enhancements for Oracle SPARC Servers Performance engineering for a revenue product Luyang Wang, Pallab Bhattacharya, Yao-Min Chen, Shrinivas Joshi and James Cheng
More informationConfiguring SR-IOV. Table of contents. with HP Virtual Connect and Microsoft Hyper-V. Technical white paper
Technical white paper Configuring SR-IOV with HP Virtual Connect and Microsoft Hyper-V Table of contents Abstract... 2 Overview... 2 SR-IOV... 2 Advantages and usage... 2 With Flex-10... 3 Setup... 4 Supported
More informationWhite paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation
White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview
More informationOctopus: A Multi-core implementation
Octopus: A Multi-core implementation Kalpesh Sheth HPEC 2007, MIT, Lincoln Lab Export of this products is subject to U.S. export controls. Licenses may be required. This material provides up-to-date general
More informationHow much energy can you save with a multicore computer for web applications?
How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green
More informationIntegrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective
Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Nathan Woods XtremeData FPGA 2007 Outline Background Problem Statement Possible Solutions Description
More informationProtoFlex: FPGA-Accelerated Hybrid Simulator
ProtoFlex: FPGA-Accelerated Hybrid Simulator Eric S. Chung, Eriko Nurvitadhi James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Multiprocessor Simulation Simulating one processor in software
More informationHW Trends and Architectures
Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty
More informationMultiprocessor Systems. COMP s1
Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve
More informationCOMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy
COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationMany-Core Computing Era and New Challenges. Nikos Hardavellas, EECS
Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationCHAPTER 8 - MEMORY MANAGEMENT STRATEGIES
CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide
More informationParallels Virtuozzo Containers
Parallels Virtuozzo Containers White Paper Parallels Virtuozzo Containers for Windows Capacity and Scaling www.parallels.com Version 1.0 Table of Contents Introduction... 3 Resources and bottlenecks...
More informationOpenSolaris and the Direction of Future Operating Systems
OpenSolaris and the Direction of Future Operating Systems James Hughes Sun Fellow Solaris Chief Technologist LISA'08 November 2008 San Diego, CA Agenda Operating System Trends Computer / OS architecture
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationSun Lustre Storage System Simplifying and Accelerating Lustre Deployments
Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments Torben Kling-Petersen, PhD Presenter s Name Principle Field Title andengineer Division HPC &Cloud LoB SunComputing Microsystems
More information