NUMA in High-Level Languages. Patrick Siegler Non-Uniform Memory Architectures Hasso-Plattner-Institut

Similar documents
JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

Garbage Collection. Hwansoo Han

Low latency & Mechanical Sympathy: Issues and solutions

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

The Z Garbage Collector An Introduction

JVM Memory Model and GC

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

2011 Oracle Corporation and Affiliates. Do not re-distribute!

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

The Z Garbage Collector Low Latency GC for OpenJDK

Managed runtimes & garbage collection

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

Hierarchical PLABs, CLABs, TLABs in Hotspot

DNWSH - Version: 2.3..NET Performance and Debugging Workshop

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Optimising Multicore JVMs. Khaled Alnowaiser

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Concurrent Garbage Collection

High Performance Managed Languages. Martin Thompson

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Java Performance Tuning and Optimization Student Guide

MEMORY MANAGEMENT HEAP, STACK AND GARBAGE COLLECTION

A new Mono GC. Paolo Molaro October 25, 2006

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Complex, concurrent software. Precision (no false positives) Find real bugs in real executions

JDK 9/10/11 and Garbage Collection

Java Internals. Frank Yellin Tim Lindholm JavaSoft

High Performance Managed Languages. Martin Thompson

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

Running class Timing on Java HotSpot VM, 1

JamaicaVM Java for Embedded Realtime Systems

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Heap Compression for Memory-Constrained Java

Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors

Lecture 13: Garbage Collection

Heckaton. SQL Server's Memory Optimized OLTP Engine

Garbage Collection Algorithms. Ganesh Bikshandi

Inside the Garbage Collector

THE TROUBLE WITH MEMORY

Adaptive Optimization using Hardware Performance Monitors. Master Thesis by Mathias Payer

Java Performance Tuning

Algorithms for Dynamic Memory Management (236780) Lecture 4. Lecturer: Erez Petrank

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

Lecture 15 Garbage Collection

Compiler construction 2009

Myths and Realities: The Performance Impact of Garbage Collection

OS-caused Long JVM Pauses - Deep Dive and Solutions

Lesson 2 Dissecting Memory Problems

Project. there are a couple of 3 person teams. a new drop with new type checking is coming. regroup or see me or forever hold your peace

Threads SPL/2010 SPL/20 1

Java & Coherence Simon Cook - Sales Consultant, FMW for Financial Services

Normal and Exotic use cases. of NUMA features. in the Linux Kernel. Christopher Lameter, Ph.D. Collaboration Summit, Napa Valley, 2014

Garbage Collection (aka Automatic Memory Management) Douglas Q. Hawkins. Why?

Java Memory Management. Märt Bakhoff Java Fundamentals

JVM Performance Tuning with respect to Garbage Collection(GC) policies for WebSphere Application Server V6.1 - Part 1

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

CS Computer Systems. Lecture 8: Free Memory Management

CS577 Modern Language Processors. Spring 2018 Lecture Garbage Collection

Garbage Collection. Steven R. Bagley

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Kotlin/Native concurrency model. nikolay

Garbage Collection. Weiyuan Li

CMSC 330: Organization of Programming Languages

MODULE 1 JAVA PLATFORMS. Identifying Java Technology Product Groups

Garbage Collection. Akim D le, Etienne Renault, Roland Levillain. May 15, CCMP2 Garbage Collection May 15, / 35

Implementation Garbage Collection

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE

The Fundamentals of JVM Tuning

PERFORMANCE. Rene Damm Kim Steen Riber COPYRIGHT UNITY TECHNOLOGIES

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

New Java performance developments: compilation and garbage collection

CS 475. Process = Address space + one thread of control Concurrent program = multiple threads of control

Fiji VM Safety Critical Java

Virtual Machine Design

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

Contents. Created by: Raúl Castillo

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

Operating Systems CMPSCI 377 Spring Mark Corner University of Massachusetts Amherst

Run-Time Environments/Garbage Collection

SPECjbb2005. Alan Adamson, IBM Canada David Dagastine, Sun Microsystems Stefan Sarne, BEA Systems

Task-Aware Garbage Collection in a Multi-Tasking Virtual Machine

Process Characteristics

Non-uniform memory access (NUMA)

Software Speculative Multithreading for Java

Present but unreachable

Chapter 4 Java Language Fundamentals

September 15th, Finagle + Java. A love story (

Performance of Non-Moving Garbage Collectors. Hans-J. Boehm HP Labs

Exploiting the Behavior of Generational Garbage Collector

CMSC 330: Organization of Programming Languages

Roadmap. Handling large amount of data efficiently. Stable storage. Parallel dataflow. External memory algorithms and data structures

Transcription:

NUMA in High-Level Languages Non-Uniform Memory Architectures Hasso-Plattner-Institut

Agenda. Definition of High-Level Language 2. C# 3. Java 4. Summary

High-Level Language Interpreter, no directly machine executable format Platform Independence Automated Memory Management 2.2.24 Chart 3

GC - Short recap Traverse reference trees to find non-referenced objects More than one GC root possible Reclaim space by moving referenced objects together Generational GC many short-lived objects old objects collected less freqently 2.2.24 Chart 4

Concurrent GC Difficult on multi-threaded systems Modification of references during scanning Lock Contention around MM data structures References may be outdated Stop-the-World at some point 2.2.24 Chart 5

GC on NUMA Systems GC compacting is copying memory Expensive across nodes Runtime faces same problem as OS: Who is going to use which memory? Young objects likely to stay on node Abstraction conflict Programs do not want to care about hardware layout Association of Threads / Tasks to nodes relevant for performance 2.2.24 Chart 6

C# - First Mutli-Processing Approach Stop-the-World when needed 2.2.24 Chart 7 Program, badgc

C# - Multi-Processing Enhancements Young generation collected per-thread foreground Old generation collected concurrently background 2.2.24 Chart 8

C# - Multi-Processing Enhancements Server GC uses dedicated high-priority threads 2 GC threads and a dedicated heap space per logical processor 2.2.24 Chart 9 Jump, AllHelp

C# - Shared data access 4 2 8 6 4 2 local remote mixed mixed,pinned onenode 2.2.24 Chart MemTravels

C# - Cost of GC,2 no new objects,2 generating new objects,8,8,6,6,4,4,2,2 2 4 6 8 2 4 6 8 2 22 24 2 4 6 8 2 4 6 8 2 22 24

C# - Manually pinning,2 not pinned,2 pinned by ID,8,6,8,6 HT-cores are uneven,4,4,2,2 2 4 6 8 2 4 6 8 2 22 24 2 4 6 8 2 4 6 8 2 22 24,2,8,6,4,2 pinned to node evenly 2 4 6 8 2 4 6 8 2 22 24,2,8,6,4,2 pinned evenly hyperthreading 2 4 6 8 2 4 6 8 2 22 24

C# - Single instance vs Two instances,2 not pinned,8,6,4,2,2,8,6,4 2 4 6 8 2 4 6 8 2 22 24 pinned evenly,2,8,6,4,2 two instances, pinned to node 2 4 6 8 2 4 6 8 2 22 24,2 2 4 6 8 2 4 6 8 2 22 24

C# - Summary Windows Unpinned threads jump (away from their memory) Natively pinned threads increase performance by >5% Interconnect usage n/a on test system Linux Mono s GC seems to suffer from lock contention 2.2.24 Chart 4

Java Various virtual machines offer many GCs with varying levels of concurrency Thread-Local Allocation Buffers (TLABs) synchronization-free allocation no NUMA-awareness Parallel Scanvenger GC (not concurrent, -XX:+UseParallelGC) -XX:+UseNUMA since Java 6u2 (+4% in SPEC JBB 25) per-node regions page interleaving for old and permanent generation -XX:+UseNUMAInterleaving on Windows 2.2.24 Chart 5

Java - OS differences (per-thread data) 2,5 2,5 RMA/LMA >=.7 Linux 2,5 2,5 Windows No location awareness, All data on node,5,5 2 4 6 8 2 4 6 8 2 22 24 2 4 6 8 2 4 6 8 2 22 24 2,5 2,5 Linux UseNUMA RMA/LMA >=.7 2,5 2,5 Windows UseNUMAInterleaving,5,5 2 4 6 8 2 4 6 8 2 22 24 2 4 6 8 2 4 6 8 2 22 24

Java - manually pinning,4,2,8,6,4,2 not pinned 2 4 6 8 2 4 6 8 2 22 24,4,2,8,6,4,2 pinned by ID Linux: HT-cores are >2 2 4 6 8 2 4 6 8 2 22 24,4,2,8,6,4,2 pinned to node evenly 2 4 6 8 2 4 6 8 2 22 24,4,2,8,6,4,2 pinned evenly 2 4 6 8 2 4 6 8 2 22 24

Java - 8 nodes 3,5 3 2,5 2,5,5 Baseline 6 32 48 64 8 96 2 28 3,5 3 2,5 2,5,5 UseNUMA 6 32 48 64 8 96 2 28 3,5 3 2,5 2,5,5 UseNUMA, UseParallelOldGC 6 32 48 64 8 96 2 28

Java - Summary Windows Limited support, only interleaving old generation Unpinned threads jump Linux Threads jump less often numatop: >.7 with UseNUMA and each thread has own data 2.2.24 Chart 9

Summary No explicit API in high-level Potentially catastrophic consequences of choice of GC Newer GC mechanisms are NUMA-aware New allocations happen in a node-local buffer Old generations are interleaved between all nodes Only GC is optimized Varying support on different OSes Under-commiting allows GC to operate concurrently 2.2.24 Chart 2

Future and unexplored issues Improving OS schedulers will also apply to high-level Tracing and performance counters to determine which memory is used Actual low-level instruction flow and hyperthreading Actual layout of objects in memory esp. after compactation Sufficient size should reduce caching effects Different JVM implementations 2.2.24 Chart 2

Sources http://blogs.msdn.com/b/dotnet/archive/22/7/2/the-net-framework-4-5-includes-new-garbage-collectorenhancements-for-client-and-server-apps.aspx http://msdn.microsoft.com/en-gb/library/ee78788.aspx#workstation_and_server_garbage_collection http://msdn.microsoft.com/en-us/magazine/cc63552.aspx http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html www.azulsystems.com/sites/default/files//images/wp_pgc_zing_v5.pdf http://www.oracle.com/technetwork/java/javase/memorymanagement-whitepaper-525.pdf http://developer.amd.com/community/blog/2//27/a-new-numa-option-for-windows-on-the-hotspot-jvm/ https://blogs.oracle.com/jonthecollector/entry/help_for_the_numa_weary http://javabook.compuware.com/content/memory/how-garbage-collection-works.aspx 2.2.24 Chart 22