Normal and Exotic use cases. of NUMA features. in the Linux Kernel. Christopher Lameter, Ph.D. Collaboration Summit, Napa Valley, 2014

Similar documents
New types of Memory, their support in Linux and how to use them with RDMA

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

Best Practices for Setting BIOS Parameters for Performance

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition

Windows Server 2012: Server Virtualization

NUMA in High-Level Languages. Patrick Siegler Non-Uniform Memory Architectures Hasso-Plattner-Institut

Multiprocessor Systems. Chapter 8, 8.1

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor Systems. COMP s1

CS533 Concepts of Operating Systems. Jonathan Walpole

Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libraries, and Toolchain

HP AutoRAID (Lecture 5, cs262a)

OVERVIEW OF DIFFERENT APPLICATION SERVER MODELS

OPERATING SYSTEM. Chapter 9: Virtual Memory

IBM Spectrum Scale on Power Linux tuning paper

Threads, SMP, and Microkernels. Chapter 4

Managing memory in variable sized chunks

Lecture 9: MIMD Architectures

64-bit ARM Unikernels on ukvm

Non-uniform memory access (NUMA)

Chapter 9: Virtual Memory

Windows Azure Services - At Different Levels

CS 326: Operating Systems. CPU Scheduling. Lecture 6

Why Multiprocessors?

Vmware VCP410. VMware Certified Professional on vsphere 4. Download Full Version :

Do Microkernels Suck? Gernot Heiser UNSW, NICTA and Open Kernel Labs

Pexip Infinity Server Design Guide

Developing cloud infrastructure from scratch: the tale of an ISP

HP AutoRAID (Lecture 5, cs262a)

Page Replacement Algorithms

Chapter 13: I/O Systems

Performance of Variant Memory Configurations for Cray XT Systems

Virtual Machines. Module 2

The Application Layer HTTP and FTP

Chapter 8: Virtual Memory. Operating System Concepts Essentials 2 nd Edition

Operating System Design Issues. I/O Management

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Contiguous memory allocation in Linux user-space

NVDIMM Overview. Technology, Linux, and Xen

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

First-In-First-Out (FIFO) Algorithm

GCMA: Guaranteed Contiguous Memory Allocator. SeongJae Park

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

Operating Systems, Fall Lecture 9, Tiina Niklander 1

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Click to edit Master subtitle. Virtual Machines Module 2

Best Practices for MySQL Scalability. Peter Zaitsev, CEO, Percona Percona Technical Webinars May 1, 2013

Chapter 9 Real Memory Organization and Management

Chapter 9 Real Memory Organization and Management

vsansparse Tech Note First Published On: Last Updated On:

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on

Memory Hierarchy. Mehran Rezaei

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Embedded Linux Birds of a Feather Session

Virtual Memory COMPSCI 386

Asymmetry-aware execution placement on manycore chips

Running High Performance Computing Workloads on Red Hat Enterprise Linux

NUMA-aware scheduling for both memory- and compute-bound tasks

V. Mass Storage Systems

Chapter 9: Virtual Memory. Operating System Concepts 9th Edition

When the OS gets in the way

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Barrelfish Project ETH Zurich. Message Notifications

Chapter 10: Virtual Memory

Red Hat Summit 2009 Rik van Riel

Chapter 8: Virtual Memory. Operating System Concepts

AutoNUMA Red Hat, Inc.

SRM-Buffer: An OS Buffer Management Technique to Prevent Last Level Cache from Thrashing in Multicores

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

HPC Architectures. Types of resource currently in use

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Kernel Korner What's New in the 2.6 Scheduler

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

FAQ. Release rc2

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Two Phase Commit Protocol. Distributed Systems. Remote Procedure Calls (RPC) Network & Distributed Operating Systems. Network OS.

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

SOFTWARE DEFINED STORAGE VS. TRADITIONAL SAN AND NAS

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Three OPTIMIZING. Your System for Photoshop. Tuning for Performance

What is a file system

EMC VPLEX with Quantum Stornext

Who stole my CPU? Leonid Podolny Vineeth Remanan Pillai. Systems DigitalOcean

Achieving high availability for Hyper-V

Computer-System Organization (cont.)

Transcendent Memory and Friends (Not just for virtualization anymore!) Dan Magenheimer, Oracle Corp.

CS 152 Computer Architecture and Engineering

Status of the Linux Slab Allocators

Chapter 10: Mass-Storage Systems

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

Performance of Variant Memory Configurations for Cray XT Systems

Transcription:

Normal and Exotic use cases of NUMA features in the Linux Kernel Christopher Lameter, Ph.D. cl@linux.com Collaboration Summit, Napa Valley, 2014

Non Uniform Memory access in the Linux Kernel Memory is segmented into nodes so that applications and the operating system can place memory there. Regular uses Nodes have local processors and the system optimizes memory to be local to the processing. This may be automated (see the next talk on Autonuma) or manually configured through system tools and/or system calls of the application. Exotic uses Nodes may not have local processors, nodes may have other memory characteristics than just latency differences of regular memory. NVRAM, High Speed caches, Multiple nodes per real memory segments, central large node.

Characteristics of Regular NUMA Memory is placed local to the processes accessing the data in order to improve performance. If computations or data requirements are beyond the capacity of a single NUMA node then data and processing need to be spread out. Interleaving to spread memory structures Balancing access to data over the available processors on multiple nodes. Automated by the NUMA autobalancer in newer Linux kernels otherwise manual control of affinities and memory policies is necessary for proper performance.

A minimal and typical 2 node NUMA system Interconnect Node 1 Memory

Regular NUMA use cases To compensate for memory latency effects due to some memory being closer to the processor than others. Useful to scale the VM for large amounts of memory. Memory reclaim and other processor occurs local to a node concurrent with the same actions on other nodes. I/O optimizations due to device NUMA effects Works best with a symmetric setup Equal sizes of memory in each node Equal number of processors Linux is optimized for regular NUMA and works quite well with a symmetric configuration.

Exotic NUMA configurations And then people started trying to use NUMA for different things which resulted in additional challenges and IMHO strange unintended effects. Memory nodes. Add memory capacity without having to pay for expensive processors. Memoryless nodes. Add processing capacity without memory. Unbalanced node setups (some nodes have lots of memory, some nodes have more processors than others etc) Central Large Memory node. Applications all allocate from that. Only processors are grouped into NUMA nodes. Special High Speed nodes for data that needs to be available fast. Memory Segmentation for resource allocation (Google, precursor to cgroups). Memory hot add, hot remove

A Memory Farm to expand system capacity Nodes not populated with processors Expand the amount of memory the system supports The need to do reclaim and memory management remotely Ability to target allocations to Memory nodes Has been used to support extremely large memory capacities Interconnect Node 1 Memory Node 2 Memory Node 3 Memory Node 4 Memory

Memoryless Nodes Processors without memory Increase processing power Ineffective since there is no local memory Node 1 Memory

Big Memory Node Configuration One node with most of the memory available. Lots of nodes with processors Avoid NUMA configurations for applications. Everything on one large node. Maybe some small amount of memory on each node for efficiency. Need specialized memory configuration Node 1 Memory

Small Fast node on the side For SRAM. Small capacity. Node is unused on boot. No local processors. Default allocs occur from Node 0. User manually request allocation on the SRAM node via memory policies ornumactl. Danger of OS allocations on the node. Bring system up with the fast node down and online it when system is running. Therefore minimal OS data structures will be placed there. Interconnect Node 1 SRAM

NVRAM (Future?) A node with information that survives the reboot process. Node has no processors and is brought up after boot is complete. Problem of avoiding the OS initialization on bootup. Small area for bootup and then the rest could be a raw memory device? Interconnect Node 1 NVRAM

Conclusion Questions Suggestions Advice Ideas?