Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Size: px
Start display at page:

Download "Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc."

Transcription

1 Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

2 Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT Aim of Victoria Falls Hardware scaling Data Scaling the OS Virtualization and Consolidation Scaling Applications Conclusions Page 2

3 Memory Bottleneck Relative Performance CPU Frequency DRAM Speeds CPU -- 2x Every 2 Years DRAM -- 2x Every 6 Years Gap Source: Sun World Wide Analyst Conference Feb. 25, 2003

4 CMT Implementation Four threads share a single pipeline Every cpu cycle an instruction from a different thread is executed Thread 4 Thread 3 Thread 2 Thread 1 C C C M C M M C M C C M C M M C M C C M C M M M Niagara Processor Shared Pipeline Utilization: Up to 85% Memory Latency Compute Time Page 4

5 Lessons from CMT Multiple cores are here to stay. Already quad core, 8 core on most roadmaps Big impact on commercial applications, Big problem for desktop and laptop software The pursuit of frequency is over. Apart from one or two vendors most have dropped their 4GHz and 5GHz plans Very high frequency usually means higher power Higher frequency usually adds complexity and lengthens time to market

6 Lessons learned from CMT Power and cooling have become extremely important to most customers Ecoresponsibility and carbon footprint are the new standards for datacenters Many datacenters are at capacity Acceleration in the development of lower power systems Also accelerated Virtualization technology. Customer trading performance versus power

7 Lessons learned from CMT Many throughput workloads benefit more from higher thread count than higher frequency CMT has proven this time and again with public benchmarks Some code is single threaded and we need to work to make them parallel Not as big an issue in the HPC space but even here can have periods such as results aggregation and report generation that are serial

8 Big Question How do you make CMT Scale to 32 cores and 256 threads Page 8

9 Aim of Victoria Falls Create an SMP version of CMT to extend the highly threaded Niagara design Use T2 as the basis for these systems Minimal modifications to T2 for shorter time to market Data Create two-way and four-way systems without the need for a traditional SMP backplane Avoid any hardware bottlenecks to scaling High throughput low latency interconnect Scale memory bandwidth with processors Scale I/O bandwidth with processors Hardware features to enable software scaling Page 9

10 Victoria Falls Processor VF is a derivative. Re-use UltraSPARC T2 cores, crossbar, L2$, PCI-Express I/F Remove 10GE network interface Replace two memory channels with four coherence channels Increase memory link speed to 4.8Gbps (was 4.0Gbps) Added relaxed DMA ordering to PCI- Ex interface All I/O on VF via the x8 PCI-E link per chip, scale I/O with processors Similar packaging as US T2 4 Coherence Channels 6.4 GB/s per channel each direction, 51GB/s total Dual Channel Coherence Unit FBDIMM Memory Controller Coherence Unit PCI-Express 2.5Ghz 2 GB/s each direction Dual Channel FBDIMM Controller Coherence Coherence Unit Memory Unit Niagara2 Cores, Crossbar, L2$ (8 cores, 64 threads, 4MB L2$) NCU, DMU 25 GB/s read 13 GB/s write NCX Page 10

11 Hardware Scaling - Interconnect Interconnect is key to scaling to 32 cores and 256 threads Decision not to use a traditional bus given the memory bandwidth requirements Instead we used half the FBDIMM channels for memory Data and the other half to create 4 bidirectional links for interconnect Physical link of interconnect same as FBDIMM Reuse the FBDIMM pins for the interconnect Result is a high speed SERDES link Provides high bandwidth and a low latency interconnect Page 11

12 VF 2-Socket System Dual Channel FBDIMM Dual Channel FBDIMM Dual Channel FBDIMM Dual Channel FBDIMM Memory Controller Memory Controller Memory Controller Memory Controller Coherence Unit Coherence Unit Coherence Unit Coherence Unit Coherence Unit Coherence Unit Coherence Unit Coherence Unit Niagara2 Cores, Crossbar, L2$ (8 cores, 64 threads, 4MB L2$) Niagara2 Cores, Crossbar, L2$ (8 cores, 64 threads, 4MB L2$) PCI-Express NCU, DMU NCX PCI-Express NCU, DMU NCX System IO (Network, Disk, etc.)

13 VF 4-Socket System PCI-Express VF FBDIMM Dual Channel Dual Channel FBDIMM VF PCI-Express 8 Cores 8 Cores 64 Threads 4MB L2$ FBDIMM Dual Channel Coherence Hubs (4) Dual Channel FBDIMM 64 Threads 4MB L2$ 4 ports/hub PCI-Express VF VF FBDIMM Dual Channel 6.4GB/s per port in each dir Dual Channel FBDIMM VF PCI-Express 8 Cores 8 Cores 8 Threads/Core 64 4MB 4MB L2$ L2$ FBDIMM Dual Channel Dual Channel FBDIMM 8 Cores 64 Threads 4MB L2$

14 Hardware Scaling - Memory Two elements are key to scaling on highly threaded servers Memory bandwidth. Hundreds of threads require a very wide pipe to memory Local vs. Remote memory latency. Systems that are highly NUMA Data can show inverse scaling if memory is placed badly Each VF processor has its own local memory connected delivering high memory bandwidth > 21GB/sec Read and 10GB/sec Write > Memory bandwidth scales with the number of processors Remote memory latency has a 76ns penalty for access, about 1.5x latency to local memory This makes VF systems NUMA although not highly so. Page 14

15 Software Scaling - scaling the OS There are a set of common issues when scaling the OS on a Multi-core highly threaded system Scaling single threaded kernel services Scaling contended mutexes Optimising memory placement Data Optimal scheduling across cores Scaling I/O Local vs. remote Delivering interrupts Virtualization Page 15

16 Single threaded kernel services Prime example of a single threaded kernel service is when Solaris performs accounting and book keeping activities every clock tick. As the number of CPUs increases, the tick accounting loop gets larger. On a 4 way VF system there are 256 threads to check Data Tick accounting is a single threaded activity and on a busy system with many CPUs, the loop can often take more than a tick to process if the locks it needs to acquire are busy. This causes clock drift and other unpleasantness. Solution is to involve multiple cpus in the Tick scheduling. Cpus are collected in groups and one is chosen to schedule all their ticks Algorithm is used to spread the tick accounting evenly across active cpus. Page 16

17 Scaling contended mutexes mutexes in Solaris are optimised for the non contended case. Contended mutexes are considered rare With up to 256 active threads in a single OS instance, calls such as atomic_add_32() on contended mutexes can become much hotter. Data Solaris solution is an Exponential backoff algorithm. This has the advantage of little overhead in the noncontended case and performance gains in the contended case Page 17

18 NUMA aware OS As we have seen VF systems are the first NUMA CMT servers A NUMA aware OS ultimately helps in scaling an application as it reduces dramatically the interconnect traffic between chips in the system Data Solaris has been NUMA aware since Solaris 9 Processes are assigned to a local lgroup where hot areas of their address space such as heap and stack are allocated and where they will be scheduled The effect is that there is an increase in accesses to lower latency local memory Page 18

19 The Hypervisor: Virtualization for CMT Platforms sun4v architecture Operating System Solaris X update (genunix) sun4u code Solaris X (genunix) Solaris X (sun4v) US-Z CPU code sun4v interface CPU Z SPARC hypervisor SPARC CPU Platform Page 19

20 The Sun4v/HV/LDOMs Model Replace HW domains with Logical Domains > Highly flexible Each Domain runs an independent OS > Capability for specialized domains Logical Domain 1 Service Domain Logical Domain 2 Solaris 10 App App Container Logical Domain 3 Open Solaris App App Container 1 App App Container 2 Hypervisor Hardware Shared CPU, Memory, IO I/O CPU Mem CPU Mem CPU Mem CPU Page 20

21 Virtualized I/O Logical Domain A Service Domain App App App App Device Driver /pci@b/qlc@6 Privileged Virtual Device Driver Virtual Device Service Nexus Driver /pci@b Hyper Privileged Hypervisor Domain Channel Virtual Nexus I/F Hardware I/O MMU PCI Root I/O Bridge PCI B Page 21

22 The future - Hybrid I/O Logical Domain A Service Domain App App App App Common Device Driver Privileged Device Driver Common Device Service Nexus Driver /pci@b Hyper Privileged Hypervisor Domain Channel Virtual Nexus I/F Hardware I/O MMU PCI Root I/O Bridge PCI B Page 22

23 Scaling Applications on many cores Scaling and throughput are the most important goals for performance on highly threaded systems High Utilisation of the pipelines is key to performance. Need many software threads or processes to utilise the hardware Datathreads. Threads have to be actually working Multiple instances of an application may be necessary to achieve scaling A single thread can become the bottleneck for the application Spinning in a tight loop waiting for a hot lock is not good for scaling on CMT Page 23

24 Utilization and Corestat In order to determine the utilization we use low level hardware counters to count the number of instructions per second per pipeline both integer and floating point We have written a tool called corestat which collects data from the low level performance counters and aggregates it to present utilization percentages Available from Corestat reports : Individual integer pipeline utilization (Default mode) Floating Point Unit utilization -g flag) Page 24

25 Locking issues The most common reason an application fails to scale on CMT is hot locks We have found this especially true when migrating from 2-4 way systems where access to hot locks and structures is effectively serialized If a system has 16 cores and 128 threads, then there can be up to 128 instruction streams executing in parallel. Hot locks which will tend to reside in the L2 cache and can become extremely contended. When developing locking code we need to use a low-impact, long-latency opcode to add delay in the busy-wait loop. The low-impact opcode frees up cycles so that other threads sharing the core get more useful work done Page 25

26 Proven Applications Areas for CMT Web-servers: Apache, SunOne, Lighttpd Web Proxy: Sun Web Proxy J2SE Appservers: BEA WLS, Websphere, SunOne, Oracle Database Servers: Oracle, Sybase, mysql, Postgres, DB2 Mail Servers: Sendmail, Domino, Brightmail, Openwave ERP: Siebel, Peoplesoft Oracle E-Business Security Page 26

27 Proven Applications Areas for CMT JES Stack: Directory, Portal, Access Manager NFS Servers DNS: BIND E-Learning: Blackboard Capacity Planing: BMC, HP, Tivoli Net Backup: Search: Lucene Development: Clearcase, compile servers Streaming Video Page 27

28 VF Performance Scaling Internal OLTP HPC Matrix Java Business CPU Intensive Single-chip 1.4Ghz Victoria Falls, 8 Cores, 64 Threads, Solaris Dual-chip 1.4Ghz Victoria Falls, 16 Cores, 128 Threads, Solaris Page 28

29 VF Availability Victoria Falls 2-socket Niagara systems Available today Victoria Falls 2-socket Niagara blades 2HCY08 Victoria Falls 4-socket Niagara systems 1HCY08 Data Page 29

30 Conclusions Multiple cores are here to stay and the pursuit of frequency is over Power and cooling are the new driving factors in the datacenter. This requires much higher utilization through scaling, consolidation and virtualization A high bandwidth, Data low latency interconnect is essential to achieve good scalability on multi-core systems Scalable memory bandwidth and physical I/O is also required An OS running on a highly threaded processor must also be scalable and in many cases NUMA aware. It must also be virtualizable Servers with many cores and hardware threads present a different set of scaling issues to applications Page 30

Logical Domains (LDoms)

Logical Domains (LDoms) Logical Domains (LDoms) Liam Merwick LDoms Developer Sun Microsystems SPARC Platform S/W Group Contents Background Features Architecture Components Future Roadmap Configuration Examples Q & A Page: 2 Background

More information

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Tom Atwood Business Development Manager Sun Microsystems, Inc. Takeaways Understand the technical differences between

More information

Overview of the SPARC Enterprise Servers

Overview of the SPARC Enterprise Servers Overview of the SPARC Enterprise Servers SPARC Enterprise Technologies for the Datacenter Ideal for Enterprise Application Deployments System Overview Virtualization technologies > Maximize system utilization

More information

Solaris Engineered Systems

Solaris Engineered Systems Solaris Engineered Systems SPARC SuperCluster Introduction Andy Harrison andy.harrison@oracle.com Engineered Systems, Revenue Product Engineering The following is intended to outline

More information

Virtual Leverage: Server Consolidation in Open Source Environments. Margaret Lewis Commercial Software Strategist AMD

Virtual Leverage: Server Consolidation in Open Source Environments. Margaret Lewis Commercial Software Strategist AMD Virtual Leverage: Server Consolidation in Open Source Environments Margaret Lewis Commercial Software Strategist AMD What Is Virtualization? Abstraction of Hardware Components Virtual Memory Virtual Volume

More information

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation

More information

Oracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011

Oracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011 Oracle Performance on M5000 with F20 Flash Cache Benchmark Report September 2011 Contents 1 About Benchware 2 Flash Cache Technology 3 Storage Performance Tests 4 Conclusion copyright 2011 by benchware.ch

More information

PERFORMANCE CONSIDERATIONS FOR DEVELOPERS UTILIZING SUN SPARC ENTERPRISE M-SERIES SERVERS

PERFORMANCE CONSIDERATIONS FOR DEVELOPERS UTILIZING SUN SPARC ENTERPRISE M-SERIES SERVERS July 2008 PERFORMANCE CONSIDERATIONS FOR DEVELOPERS UTILIZING SUN SPARC ENTERPRISE M-SERIES SERVERS Important note: this paper was originally published before the acquisition of Sun Microsystems by Oracle

More information

Power your planet. Optimizing the Enterprise Data Center POWER7 Powers a Smarter Infrastructure

Power your planet. Optimizing the Enterprise Data Center POWER7 Powers a Smarter Infrastructure Power your planet. Optimizing the Enterprise Data Center POWER7 Powers a Smarter Infrastructure Enoch Lau Field Technical Sales Specialist, Power Systems Systems & Technology Group Power your planet. Smarter

More information

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers The Missing Piece of Virtualization I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers Agenda 10 GbE Adapters Built for Virtualization I/O Throughput: Virtual & Non-Virtual Servers Case

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Sun Fire V880 System Architecture. Sun Microsystems Product & Technology Group SE

Sun Fire V880 System Architecture. Sun Microsystems Product & Technology Group SE Sun Fire V880 System Architecture Sun Microsystems Product & Technology Group SE jjson@sun.com Sun Fire V880 Enterprise RAS Below PC Pricing NEW " Enterprise Class Application and Database Server " Scalable

More information

These slides contain projections or other forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and

These slides contain projections or other forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and These slides contain projections or other forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934,

More information

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA The AMD64 Technology for Server and Workstation Dr. Ulrich Knechtel Enterprise Program Manager EMEA Agenda Direct Connect Architecture AMD Opteron TM Processor Roadmap Competition OEM support The AMD64

More information

Performance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades

Performance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades Performance Benefits of OpenVMS V8.4 Running on BL8xc i2 Server Blades A detailed review of performance features and test results for OpenVMS V8.4. March 211 211, TechWise Research. All Rights Reserved

More information

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Jeffrey Young, Sudhakar Yalamanchili School of Electrical and Computer Engineering, Georgia Institute of Technology Talk

More information

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical

More information

NUMA replicated pagecache for Linux

NUMA replicated pagecache for Linux NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 28 International ANSYS Conference Maximizing Performance for Large Scale Analysis on Multi-core Processor Systems Don Mize Technical Consultant Hewlett Packard 28 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage

More information

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc. Gemini: A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor Sanjiv Kapil Gemini Architect Sun Microsystems, Inc. Design Goals Designed for compute-dense, transaction oriented systems (webservers,

More information

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Agenda What is NUMA, anyway? Automatic NUMA balancing internals

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

An Oracle White Paper February Enabling End-to-End 10 Gigabit Ethernet in Oracle s Sun Netra ATCA Product Family

An Oracle White Paper February Enabling End-to-End 10 Gigabit Ethernet in Oracle s Sun Netra ATCA Product Family An Oracle White Paper February 2011 Enabling End-to-End 10 Gigabit Ethernet in Oracle s Sun Netra ATCA Product Family Executive Overview... 1! Introduction... 1! An Ecosystem Delivering Bandwidth and Flexibility...

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

Maximizing Six-Core AMD Opteron Processor Performance with RHEL Maximizing Six-Core AMD Opteron Processor Performance with RHEL Bhavna Sarathy Red Hat Technical Lead, AMD Sanjay Rao Senior Software Engineer, Red Hat Sept 4, 2009 1 Agenda Six-Core AMD Opteron processor

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

SunFire range of servers

SunFire range of servers TAKE IT TO THE NTH Frederic Vecoven Sun Microsystems SunFire range of servers System Components Fireplane Shared Interconnect Operating Environment Ultra SPARC & compilers Applications & Middleware Clustering

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect

Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect Copyright 2017, Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The following is intended to

More information

Sun N1: Storage Virtualization and Oracle

Sun N1: Storage Virtualization and Oracle OracleWorld 2003 Session 36707 - Sun N1: Storage Virtualization and Oracle Glenn Colaco Performance Engineer Sun Microsystems Performance and Availability Engineering September 9, 2003 Background PAE works

More information

Link Aggregation: A Server Perspective

Link Aggregation: A Server Perspective Link Aggregation: A Server Perspective Shimon Muller Sun Microsystems, Inc. May 2007 Supporters Howard Frazier, Broadcom Corp. Outline The Good, The Bad & The Ugly LAG and Network Virtualization Networking

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 > Agenda Sun s x86 1. Sun s x86 Strategy 2. Sun s x86 Product Portfolio 3. Virtualization < 1 > 1. SUN s x86 Strategy Customer Challenges Power and cooling constraints are very real issues Energy costs are

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Workload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013

Workload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013 Workload Optimized Systems: The Wheel of Reincarnation Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013 Outline Definition Technology Minicomputers Prime Workstations Apollo Graphics

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching QLE10000 Series Adapter Provides Application Benefits Through I/O Caching QLogic Caching Technology Delivers Scalable Performance to Enterprise Applications Key Findings The QLogic 10000 Series 8Gb Fibre

More information

Oracle Exadata: Strategy and Roadmap

Oracle Exadata: Strategy and Roadmap Oracle Exadata: Strategy and Roadmap - New Technologies, Cloud, and On-Premises Juan Loaiza Senior Vice President, Database Systems Technologies, Oracle Safe Harbor Statement The following is intended

More information

POWER7+ TM IBM IBM Corporation

POWER7+ TM IBM IBM Corporation POWER7+ TM 2012 Corporation Outline POWER Processor History Design Overview Performance Benchmarks Key Features Scale-up / Scale-out The new accelerators Advanced energy management Summary * Statements

More information

State of the Linux Kernel

State of the Linux Kernel State of the Linux Kernel Timothy D. Witham Chief Technology Officer Open Source Development Labs, Inc. 1 Agenda Process Performance/Scalability Responsiveness Usability Improvements Device support Multimedia

More information

Linux multi-core scalability

Linux multi-core scalability Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Replacement Policy: Which block to replace from the set?

Replacement Policy: Which block to replace from the set? Replacement Policy: Which block to replace from the set? Direct mapped: no choice Associative: evict least recently used (LRU) difficult/costly with increasing associativity Alternative: random replacement

More information

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007 Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster

More information

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Goro Watanabe. Bill King. OOW 2013 The Best Platform for Big Data and Oracle Database 12c. EVP Fujitsu R&D Center North America

Goro Watanabe. Bill King. OOW 2013 The Best Platform for Big Data and Oracle Database 12c. EVP Fujitsu R&D Center North America OOW 2013 The Best Platform for Big Data and Oracle Database 12c Goro Watanabe EVP Fujitsu R&D Center North America Bill King EVP Platform Products Group Fujitsu America, Inc. Overview 1. Fujitsu: Quick

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University EE108B Lecture 17 I/O Buses and Interfacing to CPU Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements Remaining deliverables PA2.2. today HW4 on 3/13 Lab4 on 3/19

More information

Virtualization. Michael Tsai 2018/4/16

Virtualization. Michael Tsai 2018/4/16 Virtualization Michael Tsai 2018/4/16 What is virtualization? Let s first look at a video from VMware http://www.vmware.com/tw/products/vsphere.html Problems? Low utilization Different needs DNS DHCP Web

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

OPENSPARC T1 OVERVIEW

OPENSPARC T1 OVERVIEW Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

MM5 Modeling System Performance Research and Profiling. March 2009

MM5 Modeling System Performance Research and Profiling. March 2009 MM5 Modeling System Performance Research and Profiling March 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center

More information

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas Multiple processor systems 1 Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor, access

More information

Experience the GRID Today with Oracle9i RAC

Experience the GRID Today with Oracle9i RAC 1 Experience the GRID Today with Oracle9i RAC Shig Hiura Pre-Sales Engineer Shig_Hiura@etagon.com 2 Agenda Introduction What is the Grid The Database Grid Oracle9i RAC Technology 10g vs. 9iR2 Comparison

More information

Example Networks on chip Freescale: MPC Telematics chip

Example Networks on chip Freescale: MPC Telematics chip Lecture 22: Interconnects & I/O Administration Take QUIZ 16 over P&H 6.6-10, 6.12-14 before 11:59pm Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Exams in ACES

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Operating Systems, Fall Lecture 9, Tiina Niklander 1 Multiprocessor Systems Multiple processor systems Ch 8.1 8.3 1 Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor,

More information

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems Niagara-2: A Highly Threaded Server-on-a-Chip Greg Grohoski Distinguished Engineer Sun Microsystems August 22, 2006 Authors Jama Barreh Jeff Brooks Robert Golla Greg Grohoski Rick Hetherington Paul Jordan

More information

Performance of Commercial Database Benchmarks on a CC-NUMA Computer System

Performance of Commercial Database Benchmarks on a CC-NUMA Computer System STiNG Revisited: Performance of Commercial Database Benchmarks on a CC-NUMA Computer System Russell M. Clapp IBM NUMA-Q rclapp@us.ibm.com January 21st, 2001 1 Background STiNG was the code name of an engineering

More information

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS Some statistics 70% of OS code is in device s 3,448,000 out of 4,997,000 loc in Linux 2.6.27 A typical Linux laptop runs ~240,000 lines of kernel code, including ~72,000 loc in 36 different device s s

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

What Transitioning from 32-bit to 64-bit x86 Computing Means Today

What Transitioning from 32-bit to 64-bit x86 Computing Means Today What Transitioning from 32-bit to 64-bit x86 Computing Means Today Chris Wanner Senior Architect, Industry Standard Servers Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

Chapter 6. Storage and Other I/O Topics

Chapter 6. Storage and Other I/O Topics Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections

More information

Realize the True Potential of Server Virtualization with Oracle VM. Rossella Bellini Principal Sales Consultant HW Business Uniti

Realize the True Potential of Server Virtualization with Oracle VM. Rossella Bellini Principal Sales Consultant HW Business Uniti Realize the True Potential of Server Virtualization with Oracle Rossella Bellini Principal Sales Consultant HW Business Uniti Virtualization Budgets Are Increasing Virtualization Is a Key Driver for Lowering

More information

Nested Virtualization and Server Consolidation

Nested Virtualization and Server Consolidation Nested Virtualization and Server Consolidation Vara Varavithya Department of Electrical Engineering, KMUTNB varavithya@gmail.com 1 Outline Virtualization & Background Nested Virtualization Hybrid-Nested

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Storage. Hwansoo Han

Storage. Hwansoo Han Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

End-to-End Java Security Performance Enhancements for Oracle SPARC Servers Performance engineering for a revenue product

End-to-End Java Security Performance Enhancements for Oracle SPARC Servers Performance engineering for a revenue product End-to-End Java Security Performance Enhancements for Oracle SPARC Servers Performance engineering for a revenue product Luyang Wang, Pallab Bhattacharya, Yao-Min Chen, Shrinivas Joshi and James Cheng

More information

Configuring SR-IOV. Table of contents. with HP Virtual Connect and Microsoft Hyper-V. Technical white paper

Configuring SR-IOV. Table of contents. with HP Virtual Connect and Microsoft Hyper-V. Technical white paper Technical white paper Configuring SR-IOV with HP Virtual Connect and Microsoft Hyper-V Table of contents Abstract... 2 Overview... 2 SR-IOV... 2 Advantages and usage... 2 With Flex-10... 3 Setup... 4 Supported

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

Octopus: A Multi-core implementation

Octopus: A Multi-core implementation Octopus: A Multi-core implementation Kalpesh Sheth HPEC 2007, MIT, Lincoln Lab Export of this products is subject to U.S. export controls. Licenses may be required. This material provides up-to-date general

More information

How much energy can you save with a multicore computer for web applications?

How much energy can you save with a multicore computer for web applications? How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green

More information

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Nathan Woods XtremeData FPGA 2007 Outline Background Problem Statement Possible Solutions Description

More information

ProtoFlex: FPGA-Accelerated Hybrid Simulator

ProtoFlex: FPGA-Accelerated Hybrid Simulator ProtoFlex: FPGA-Accelerated Hybrid Simulator Eric S. Chung, Eriko Nurvitadhi James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Multiprocessor Simulation Simulating one processor in software

More information

HW Trends and Architectures

HW Trends and Architectures Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

Parallels Virtuozzo Containers

Parallels Virtuozzo Containers Parallels Virtuozzo Containers White Paper Parallels Virtuozzo Containers for Windows Capacity and Scaling www.parallels.com Version 1.0 Table of Contents Introduction... 3 Resources and bottlenecks...

More information

OpenSolaris and the Direction of Future Operating Systems

OpenSolaris and the Direction of Future Operating Systems OpenSolaris and the Direction of Future Operating Systems James Hughes Sun Fellow Solaris Chief Technologist LISA'08 November 2008 San Diego, CA Agenda Operating System Trends Computer / OS architecture

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments Torben Kling-Petersen, PhD Presenter s Name Principle Field Title andengineer Division HPC &Cloud LoB SunComputing Microsystems

More information