Cross-Layer Memory Management to Reduce DRAM Power Consumption

Similar documents
Cross-Layer Memory Management for Managed Language Applications

Cross-Layer Memory Management for Managed Language Applications

39 Cross-Layer Memory Management to Improve DRAM Energy Efficiency 1

Cross-Layer Memory Management to Improve DRAM Energy Efficiency

Optimising Multicore JVMs. Khaled Alnowaiser

Exploring Dynamic Compilation and Cross-Layer Object Management Policies for Managed Language Applications. Michael Jantz

VIProf: A Vertically Integrated Full-System Profiler

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Hierarchical Real-time Garbage Collection

High Performance Managed Languages. Martin Thompson

Memory Energy Management for an Enterprise Decision Support System

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Efficient Runtime Tracking of Allocation Sites in Java

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Java Application Performance Tuning for AMD EPYC Processors

High Performance Managed Languages. Martin Thompson

Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications

Low latency & Mechanical Sympathy: Issues and solutions

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Phase-based Adaptive Recompilation in a JVM

Reliability, Availability, Serviceability (RAS) and Management for Non-Volatile Memory Storage

NUMA-aware OpenMP Programming

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

Understanding Reduced-Voltage Operation in Modern DRAM Devices

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Running class Timing on Java HotSpot VM, 1

Method-Level Phase Behavior in Java Workloads

Virtual Asymmetric Multiprocessor for Interactive Performance of Consolidated Desktops

Java Garbage Collector Performance Measurements

Exploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Fundamentals of GC Tuning. Charlie Hunt JVM & Performance Junkie

The benefits and costs of writing a POSIX kernel in a high-level language

Hierarchical PLABs, CLABs, TLABs in Hotspot

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Compiler construction 2009

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Garbage Collection. Hwansoo Han

Performance of Multicore LUP Decomposition

SANDPIPER: BLACK-BOX AND GRAY-BOX STRATEGIES FOR VIRTUAL MACHINE MIGRATION

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Runtime Application Self-Protection (RASP) Performance Metrics

Presented by: Nafiseh Mahmoudi Spring 2017

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Optimization Coaching for Fork/Join Applications on the Java Virtual Machine

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Exploiting hardware heterogeneity in public clouds

Scaling PostgreSQL on SMP Architectures

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Real-Time Cache Management for Multi-Core Virtualization

High Performance Java Remote Method Invocation for Parallel Computing on Clusters

Enabling Java-based VoIP backend platforms through JVM performance tuning

CGO:U:Auto-tuning the HotSpot JVM

The Z Garbage Collector An Introduction

Using Transparent Compression to Improve SSD-based I/O Caches

Introduction to Virtual Machines. Michael Jantz

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

New Java performance developments: compilation and garbage collection

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

For extreme parallelism, your OS is sooooolast-millennium

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

HPC in Cloud. Presenter: Naresh K. Sehgal Contributors: Billy Cox, John M. Acken, Sohum Sohoni

Bei Wang, Dmitry Prohorov and Carlos Rosales

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Re-architecting Virtualization in Heterogeneous Multicore Systems

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Data Center Virtualization: Xen and Xen-blanket

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

AOT Vs. JIT: Impact of Profile Data on Code Quality

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

VARIABILITY IN OPERATING SYSTEMS

The Z Garbage Collector Low Latency GC for OpenJDK

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Exploiting the Behavior of Generational Garbage Collector

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

RAIN: Reinvention of RAID for the World of NVMe

Towards Energy-Efficient Reactive Thermal Management in Instrumented Datacenters

SPECjAppServer2002 Statistics. Methodology. Agenda. Tuning Philosophy. More Hardware Tuning. Hardware Tuning.

HPC Architectures. Types of resource currently in use

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo

JIT Compilation Policy for Modern Machines

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Page 2 of 6 SUT Model Form Factor CPU CPU Characteristics Number of Systems 1 Nodes Per System 1 Chips Per System 2 Hardware hw_1 Cores Per System 44

SFS: Random Write Considered Harmful in Solid State Drives

Performance Tools for Technical Computing

Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. Kyle C. Hale and Peter Dinda

SPECjbb2005. Alan Adamson, IBM Canada David Dagastine, Sun Microsystems Stefan Sarne, BEA Systems

Transcription:

Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1

Introduction Assistant Professor at UT since August 2014 Before UT PhD in Computer Science at KU (July 2014) Intern at Intel Corporation (2012 2013) Research interests: Compilers (optimization, phase ordering) Operating Systems (kernel instrumentation, memory and power management) Runtime Systems (dynamic compilation, object mgmt.) Courses taught: Compilers (COSC 461), Discrete Structures (COSC 311) 2

Outline Compiler Optimization Phase Ordering Dynamic Compilation Cross-Layer Memory Management Motivation Design Experimental Evaluation Future Directions Conclusions 3

Compiler Optimization Phase Ordering 4

Phase Ordering Compiler optimizations operate in phases Phases interact with each other Phase ordering: different phase orderings produce different quality code Problem: finding the best ordering for each function or program takes a very long time Iterative search is the most common technique 5

Exploiting Phase Interactions Our approach: identify and exploit phase interactions during search Major contributions: Reduce exhaustive phase ordering search time Increase applicability and effectiveness of individual optimization phases Improve phase ordering heuristics Publications: LCTES 10 [1], CASES 10, [2] CASES 13 [3], S:P&E (Jan. 13) [4] 6

Dynamic Compilation 7

Tradeoffs in Dynamic Compilation Managed language applications (e.g. Java) Distributed as machine-independent codes Require compilation at runtime Dynamic compilation policies involve tradeoffs Can potentially slow down overall performance Must consider several factors when setting policy: Compiling speed and quality of compiled code Execution frequency of individual methods Availability of compilation resources 8

Dynamic Compilation Strategies Conducted multiple studies on how, when, and if to compile program methods Employ industrial-grade Java VM (HotSpot) Major studies: Performance potential of phase selection in dynamic compilers (VEE '13-A [5]) Dynamic compilation strategy on modern machines (TACO, Dec. '13 [6]) 9

Cross-Layer Memory Management 10

A Collaborative Approach to Memory Management Memory has become a significant player in power and performance Memory power management is challenging Propose a collaborative approach between applications, operating system, and hardware: Applications communicate memory usage intent to OS OS re-architect memory mgmt. to interpret application intent and manage memory over hardware units Hardware communicate hardware layout to the OS to guide memory management decisions 11

A Collaborative Approach to Memory Management Implemented framework by re-architecting a recent Linux kernel Experimental evaluation Publications: VEE 13-B [7], Linux Symposium 14 [8], manuscript in submission [9] 12

Why CPU and Memory are most significant players for power and performance In servers, memory power == 40% of total power [10] Applications can direct CPU usage threads may be affinitized to individual cores or migrated b/w cores prioritize threads for task deadlines (with nice) individual cores may be turned off when unused Surprisingly, much of this flexibility does not exist for controlling memory 13

Example Scenario System with database workload with 512GB DRAM All memory in use, but only 2% of pages are accessed frequently CPU utilization is low How to reduce power consumption? 14

Challenges in Managing Memory Power Memory refs. have temporal and spatial variation At least two levels of virtualization: Virtual memory abstracts away application-level info Physical memory viewed as single, contiguous array of storage No way for agents to cooperate with the OS and with each other Lack of a tuning methodology 15

A Collaborative Approach Our approach: enable applications to guide mem. mgmt. Requires collaboration between the application, OS, and hardware: Interface for communicating application intent to OS Ability to keep track of which memory modules host which physical pages during memory mgmt. To achieve this, we propose the following abstractions: Colors Trays 16

Communicating Application Intent with Colors Software Intent Color Tray Memory Allocation and Freeing Color = a hint for how pages will be used Colors applied to sets of virtual pages that are alike Attributes associated with each color Attributes express different types of distinctions: Hot and cold pages (frequency of access) Pages belonging to data structures with different usage patterns Allow applications to remain agnostic to lower level details of mem. mgmt. 17

Power-Manageable Units Represented as Trays Software Intent Color Tray Tray = software structure containing sets of pages that constitute a power-manageable unit Requires mapping from physical addresses to power-manageable units ACPI 5.0 defines memory power state table (MPST) to expose this mapping Re-architect a recent Linux Kernel to perform memory management over trays Memory Allocation and Freeing 18

M0 M1 M2 M3 M4 M5 M6 M7 Application Hot pages Application colors pages to indicate a range of pages will be hot V1 V2 VN Cold pages Seq. Access Operating System P1 P2 Physical memory allocation and recyclying PN OS looks up attribute associated with the virtual pages color Trays: T0 T1 T2 T3 T4 T5 T6 T7 Pages: Hardware Memory topology represented in the OS using trays Controller Controller Controller Controller CH0 CH1 CH0 CH1 NUMA Node 0 NUMA Node 1 19

Experimental Evaluation Emulating NUMA API s Memory prioritization for applications Reducing DRAM power consumption Power-saving potential of containerized memory management Localized allocation and recycling Exploiting generational garbage collection 20

Automatic Cross-Layer Memory Management Limitations of application guidance: Little understanding of which colors or coloring hints will be most useful for existing workloads All colors and hints must be manually inserted Our approach: integrate with profiling and analysis to automatically provide power / bandwidth mgmt. Implemented using the HotSpot JVM Instrumentation and analysis to build memory profile Partition live objects into separately colored regions 21

Application Heap Young generation Execution Engine Hot eden Hot survivors Cold eden Cold survivors Object profiling and analysis JIT Compiler Hot tenured Tenured generation Cold tenured Garbage Collection Employ the default HotSpot config. for server-class applications Divide survivor / tenured spaces into spaces for hot / cold objects 22

Application Heap Young generation Execution Engine Hot eden Hot survivors Cold eden Cold survivors Object profiling and analysis JIT Compiler Hot tenured Tenured generation Cold tenured Garbage Collection Color spaces on creation or resize Partition allocation sites and objects into hot / cold sets 23

Potential of JVM Framework Our goal: evaluate power-saving potential when hot / cold objects are known statically MemBench: Java benchmark that uses different object types for hot / cold memory HotObject and ColdObject Contain memory resources (array of integers) Implement different functions for accessing mem. 24

Experimental Platform Hardware Single node of 2-socket server machine Processor: Intel Xeon E5-2620 (12 threads @ 2.1GHz) Memory: 32GB DDR3 memory (four DIMM s, each connected to its own channel) Operating System CentOS 6.5 with Linux 2.6.32 HotSpot JVM v. 1.6.0_24, 64-bit Default configuration for server-class applications 25

The MemBench Benchmark Object allocation Creates HotObject and ColdObject objects in a large in-memory array # of hots < # of colds (~15% of all objects) Object array occupies most (~90%) system mem. Multi-threaded object access Object array divided into 12 separate parts, each passed to its own thread Iterate over object array, only accessing hot objects Optional delay parameter 26

MemBench Configurations Three configurations Default Tray-based kernel (custom kernel, default HotSpot) Hot/cold organize (custom kernel, custom HotSpot) Delay varied from "no delay" to 1000ns With no delay, 85ns between memory accesses 27

Perf. (runtime) (P(X) / P(DEF)) Bandwidth (GB /s) MemBench Performance 3.5 3 2.5 2 1.5 1 0.5 0 default tray-based kernel hot/cold organize 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses 25 20 15 10 5 0 Tray-based kernel has about same performance as default Hot/cold organize exhibits poor performance with low delay 28

Perf. (runtime) (P(X) / P(DEF)) Bandwidth (GB /s) MemBench Bandwidth 3.5 3 2.5 2 1.5 1 0.5 0 default tray-based kernel hot/cold organize 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses 25 20 15 10 5 0 Default and tray-based kernel produce high memory bandwidth when delay is low Placement of hot objects across multiple channels enables higher bandwidth 29

Perf. (runtime) (P(X) / P(DEF)) Bandwidth (GB /s) MemBench Bandwidth 3.5 3 2.5 2 1.5 1 0.5 0 default tray-based kernel hot/cold organize 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses 25 20 15 10 5 0 Hot/cold organize - hot objects co-located on single channel Increased delays reduces bandwidth reqs. of the workload 30

Energy consumed relative to default (J) (J(X) / J(DEF)) MemBench Energy 2 1.8 1.6 1.4 1.2 tray-based kernel (DRAM only) tray-based kernel (CPU+DRAM) hot/cold organize (DRAM only) hot/cold organize (CPU+DRAM) 1 0.8 0.6 0.4 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses Hot/cold organize consumes much less power with low delay Even when BW reqs. are reduced, hot/cold organize consumes less power than other configurations 31

Energy consumed relative to default (J) (J(X) / J(DEF)) MemBench Energy 2 1.8 1.6 1.4 1.2 tray-based kernel (DRAM only) tray-based kernel (CPU+DRAM) hot/cold organize (DRAM only) hot/cold organize (CPU+DRAM) 1 0.8 0.6 0.4 85 100 150 200 300 500 750 1000 Time (ns) between memory accesses Significant energy savings potential with custom JVM Max. DRAM energy savings of ~39%, max. CPU+DRAM energy savings of ~15% 32

Results Summary Object partitioning strategies Offline approach partitions allocation points Online approach uses sampling to predict object access patterns Evaluate with standard sets of benchmarks DaCapo, SciMark Achieve 10% average DRAM energy savings, 2.8% CPU+DRAM reduction Performance overhead 2.2% for offline, 5% for online 33

Current and Future Projects in Cross-Layer Memory Management Immediate future work: address performance losses of our current approach Improve the online sampling Automatic bandwidth management Applications for heterogeneous memory architectures Exploit data object placement within each page to improve efficiency 34

Conclusions Research focuses on software systems Compilers, operating systems, and runtime systems Cross-layer memory management Achieving power/performance efficiency in memory requires a cross-layer approach First framework to use usage patterns of application objects to steer low-level memory mgmt. Approach shows promise for reducing DRAM energy Opens several avenues for future research in collaborative memory management 35

Questions? 36

References 1. Prasad Kulkarni, Michael Jantz, and David Whalley. Improving Both the Performance Benefits and Speed of Optimization Phase Sequence Searches In the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '10), April 2010 2. Michael Jantz and Prasad Kulkarni. Eliminating False Phase Interactions to Reduce Optimization Phase Order Search Space. In the ACM/IEEE International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES '10), October 24-29, 2010. 3. Michael Jantz and Prasad Kulkarni. Exploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches. In the ACM/IEEE International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES '13), September 29 - October 4, 2013. 4. Michael Jantz and Prasad Kulkarni. Analyzing and Addressing False Phase Interactions During Compiler Optimization Phase Ordering. In Software: Practice and Experience. January 2013. 5. Michael Jantz and Prasad Kulkarni. Exploring Single and Multi-Level JIT Compilation Policy for Modern Machines. In ACM Transactions on Architecture and Code Optimization (TACO). December 2013. 6. Michael Jantz and Prasad Kulkarni. Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation. In the ACM SIGPLAN Conference on Virtual Execution Environments (VEE '13), March 16-17, 2013. 37

References 7. Michael Jantz, Carl Strickland, Karthik Kumar, Martin Dimitrov, and Kshitij A. Doshi. A Framework for Application Guidance in Virtual Memory Systems. In the ACM SIGPLAN Conference on Virtual Execution Environments (VEE '13), March 16-17, 2013. 8. Michael Jantz, Kshitij Doshi, Prasad Kulkarni, and Heechul Yun. Leveraging MPST in Linux with Application Guidance to Achieve Power-Performance Goals. In Linux Symposium, Ottawa, Canada. May 2014. 9. Michael Jantz, Forrest Robinson, Prasad Kulkarni, and Kshitij Doshi. Cross-Layer Memory Management for Managed Language Applications. In submission. July 2015. 10. C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. W. Keller. Energy management for commercial servers. Computer,36 (12):39 48, Dec. 2003 38